arxiv: v2 [cs.ds] 14 Jan 2016

Size: px
Start display at page:

Download "arxiv: v2 [cs.ds] 14 Jan 2016"

Transcription

1 On the Average-case Coplexity of Pattern Matching with Wildcards Carl Barton Blizard Institute, Bart s and the London School of Medicine and Dentistry, Queen Mary University of London, UK c.barton@qul.ac.uk arxiv: v [cs.ds] 4 Jan 06 Abstract Pattern atching with wildcards is the proble of finding all factors of a text t of length n that atch a pattern x of length, where wildcards (characters that atch everything) ay be present. In this paper we present a nuber of fast average-case algoriths for pattern atching where wildcards are restricted to either the pattern or the text, however, the results are easily adapted to the case where wildcards are allowed in both. We analyse the average-case coplexity of these algoriths and show the first non-trivial tie bounds. These are the first results on the average-case coplexity of pattern atching with wildcards which, as a by product, provide with first provable separation in tie coplexity between exact pattern atching and pattern atching with wildcards in the word RAM odel. Introduction Pattern atching with wildcards is a string atching proble where the alphabet consists of standard letters and a wildcard φ which atches every character in the alphabet. Given a text t of length n and a pattern x of length < n the proble consists of finding all factors of the text that atch the pattern. Pattern atching with wildcards naturally arises in a nuber of probles in bioinforatics, they are priarily used to odel single nucleotide polyorphiss (SNP), being used in the identification of diseases and in genoe wide association studies as arkers for gene apping. An early result in pattern atching with wildcards was the Fast Fourier Transfor (FFT) based algorith of Fischer and Paterson [9] with runtie O(n log log σ) for an alphabet of size σ. A subsequent study by Pinter [5] outlined why the intransitivity of the atch relation prevents algoriths such as KMP [9] being used or easily odified when wildcards are present. After Fischer and Patterson, uch work focused on iproving the algoriths by reoving the dependency on the alphabet size, with randoized O(n log n) and O(n log ) solutions being proposed in [6] and [7] respectively. Deterinistic O(n log ) solutions were proposed soon after, first by Cole and Hariharan [7] and then siplified by Clifford and Clifford [5]. The only known lower bound for the worst case of this proble is due to Muthukrishnan and Pale [3] who showed that in the worst case the proble is equivalent to coputing boolean convolutions. The indexing version of the proble was studied first in [5] where an index supporting queries in O( + α) tie was presented where wildcards are allowed in the pattern. A short coing of this approach is that in the worst-case α ay be Θ(hn) where h is the nuber of groups of wildcards. In [6] Cole et al. presented an index which given a text with k wildcards and an integer d, allows searching for any pattern with at ost d wildcards. For a pattern containing g d wildcards, the atching takes O( + g log k n log log n + occ) tie ; when wildcards are restricted to either the pattern or the text the query tie becoes We use the notation of [0] as it is ore understandable than [6]. licensed under Creative Coons License CC-BY Leibniz International Proceedings in Inforatics Schloss Dagstuhl Leibniz-Zentru für Inforatik, Dagstuhl Publishing, Gerany

2 On the Average-case Coplexity of Pattern Matching with Wildcards Proble Query Tie Wildcards in t O( log n + γ + occ) Wildcards in x O( + hβ) Wildcards in t and x O( log n + hβ + γ + occ) Opt wildcards in t O( log n + log n + γ log n + occ) Opt wildcards in x O( + ghβ) Opt wildcards in t and x O( log n + log n + ghβ + γ log n + occ) Table Properties of the indexes presented in [0] O( + g log log n + occ) and O( + log k n log log n + occ) respectively. A drawback of the index of Cole et al. is that once the index has been built it can only be used to search for patterns with at ost d wildcards. Very recently a nuber of indexes were presented by Bille et al. [4], where a linear space index with query tie O( + σ g log log n + occ) and a linear query tie index with space coplexity O(σ d n log d log n). The results of Billie et al. can be iproved using recent work on weighted ancestor queries [3]. In the area of linear size indexes for strings with wildcards, La et al. [0] have presented indexes for a nuber of probles shown in Table, where for a pattern x consisting of strings x 0, x,.., x j interleaved by j wildcards, t i is analogously defined for a text t of length n, occ(u, v) denotes the nuber of occurrences of u in v, γ = Σ l+ j= occ(t j, x), h is the nuber groups of consecutive wildcards in the text, g is the total nuber of wildcards in the text, β = in i h+ {occ(x i, t)}. Succinct indexes have been presented in [6] with a space usage of (( + o())n log σ + O(n) + O(h log n) + O(j log j)) bits for a text containing h groups of j wildcards in total. The authors of [4] proposed a copressed index where wildcards can only occur in the text with space usage nh y + o(n log σ) + O(h log n) bits, where H y is the y-th-order epirical entropy (y = o(log σ n)) of the text. The first non-trivial o(n log n) bit indexes were recently presented in []. In this paper we focus on the average-case coplexity of the proble. To the best of our knowledge no results are known. Preliinaries An alphabet Σ is a finite non-epty set, of size σ, whose eleents are called letters. A string on an alphabet Σ is a finite, possibly epty, sequence of eleents of Σ. The zero-letter sequence is called the epty string, and is denoted by ε. The length of a string x is defined as the length of the sequence associated with the string x, and is denoted by x. All strings of length q are denoted by Σ q and refer to any x Σ q as a q-gra. We denote by x[i], for all 0 i < x, the letter at index i of x. Each index i, for all 0 i < x, is a position in x when x ε. It follows that the i-th letter of x is the letter at position i in x, and that x = x[0.. x ]. A string x is a factor of a string y if there exist two strings u and v, such that y = uxv. Consider the strings x, y, u, and v, such that y = uxv. If u = ε, then x is a prefix of y. If v = ε, then x is a suffix of y. A wildcard character is a special character that does not belong to alphabet Σ, and atches with itself as well as with any character of Σ; it is denoted by φ. Two characters a

3 Carl Barton 3 and b of alphabet Σ {φ} are said to correspond (denoted by a φ b) if they are equal or at least one of the is the wildcard letter. Let x be a non-epty string and y be a string. We say that there exists an occurrence of x in y or, ore siply, that x occurs in y when x is a factor of y. Every occurrence of x can be characterised by a position in y. Thus we say that x occurs at the starting position i in y when y[i.. i + x ] = x. It is soeties ore suitable to consider the ending position i + x. To be consistent with previous works on pattern atching with wildcards we consider the word RAM odel of coputation with word size Ω(log n) In this paper the probles we consider are the following. Proble (Wildcards in the Text). Given a text t of length n drawn fro Σ {φ}, and a pattern x of length drawn fro Σ. Find all i such that t[i.. i + ] φ x[0.. ]. Proble (Wildcards in the Pattern). Given a text t of length n drawn fro Σ, and a pattern x of length drawn fro Σ {φ}. Find all i such that t[i.. i+ ] φ x[0.. ]. 3 Background on Average-case Analysis Here we give soe background inforation on the literature concerning average-case coplexity. The ter average-case has been used to refer to various different assuptions when discussing online pattern atching in strings, here we discuss these different assuptions and justify the odel we use. In the literature soeties it is assued that the pattern is randoly drawn fro the alphabet, whilst others consider that the pattern is arbitrary. An arbitrary pattern is assued in [8, 7, 9], however, in later work such as [0,,,,, 4, ] the assuption of a rando pattern is ade. Clearly the notion of average-case coplexity with arbitrary patterns is stronger than with rando patterns and that is what we consider here. Soething else we consider is the fixed or adaptive nature of the sequence of probing positions applied to the text. We refer to the sequence of probing positions as the inspection schee. In [7] it was shown that having a predeterined inspection schee negatively effects the runtie of exact pattern atching algoriths for < n < when copared with an adaptive inspection schee. In this paper we explore the effect this property can have on the average-case perforance of algoriths for pattern atching with wildcards and refer to algoriths which exaine the character inside a window in a fixed order as fixed algoriths. For the purpose of showing lower bounds in this paper we consider a siplified odel of coputation for fixed algoriths and define the as follows. Definition (Fixed Algorith). Consider t partitioned into non-overlapping blocks of size. Let (i, i,.., i ) be an arbitrary but fixed perutation of (0,,, 3,.., ). An algorith is fixed with respect to (i, i,.., i ) if, for every block, the sequence of probing positions is (i, i,.., i ). We wish to find all factors of t corresponding to x that are entirely within a block. We consider the following siplified definition of non-fixed algoriths. Definition (Non-fixed Algorith). Consider t partitioned into non-overlapping blocks of size. An algorith is non-fixed if characters of t can be inspected in any order. We wish to find all factors of t corresponding to x that are entirely within a block. The order characters are read in the text.

4 4 On the Average-case Coplexity of Pattern Matching with Wildcards We show a tight upper bound on the best perforance for fixed algoriths and for non-fixed algoriths we show upper and lower bounds which atch within a logarithic factor for all but the ost extree values of g. The upper bounds for fixed and non-fixed algoriths are quite different. It is iportant to point out that when discussing the average-case coplexity of online string atching probles it is custoary to ake a distinction between tie taking to preprocess the pattern and the search tie. average-case optial custoarily refers to achieving the optial search tie, not necessarily considering the preprocessing tie required to achieve it. Our Contribution: In this article, we present fast average-case algoriths for pattern atching with wildcards restricted to the text or pattern. We present algoriths with low preprocessing and average-case search tie O( n log σ ) and O( n(g+log σ ) g ) for Probles and which are optial in the non-fixed and fixed odels respectively. We show an algorith with optial average-case search tie for Proble in the non-fixed odel that has coplexity between O( n log σ g ) and O( n log σ log ). 4 Algoriths In the following section we present a nuber of filtering algoriths for pattern atching in the presence of wildcards. The presented algoriths consist of two distinct schees: the filtering schee, which deterines if the currently considered text window potentially has a valid occurrence; in case the window ay contain a valid occurrence, we are required to check the window for valid occurrences of the pattern; this is done through the verification schee. Intuitively, the algoriths consider a sliding window of length of the text, and reads q-gras backwards fro the centre of the window until it is likely to have found a difference in every possible occurrence, allowing us to skip the window. That is, we wish to ake the probability of a verification being triggered sufficiently unlikely whilst also ensuring we can shift the window a reasonable aount. The verification schee used in this algorith consists of naively checking all possible alignents of the pattern against the text. Clearly each check takes no ore than O() tie and there are possible start positions for a window of size so O( ) in total. For the rest of the article we refer to this verification schee as VER(i, x) where i is the start of the window and x is the pattern. For the rest of the article we assue that the text t is of length n and is rando and uniforly drawn fro Σ or Σ {φ}, depending on the proble, and that Σ is a finite but not necessarily constant alphabet. 4. Filtering Schee The filtering schee of the presented algorith requires the preprocessing and indexing of the pattern x. We first present the preprocessing required and then present the searching technique itself. Preprocessing. We outline two indexing schees that will be used in this paper. The first is a basic q-gra index adapted to allow wildcards in the query, which siply considers if a atching q-gra exists in the indexed text. To build this index we generate all strings of

5 Carl Barton 5 length q fro the alphabet Σ {φ}, and check each against the set of q length factors fro the pattern. Since there are (σ + ) q different generated strings, at ost factors of length q in the pattern and each check requires O(q) tie, the total tie spent is O(q(σ + ) q ). We can store the results of this processing in a binary array where each string is converted to a nuerical representation for efficient lookup. A dictionary built in this way has a search tie of O(q). We refer to this schee as q-basic. The second indexing schee is based on the siple q-gra index but allows wildcards in the text and for ultiple queries allows us to enforce a weak ordering on the q-gras. The intuition for this schee is that if we index a text with g wildcards, it ay be the case that all of these wildcards occur sequentially. In this situation if q < g then the basic q-gra index will report a atch for every single query. This akes it ipossible for us to use the basic q- gra index as part of a filtering schee unless q > g which gives an exponential dependency on g. To get around this issue we build a q-gra index by considering every length q factor of the pattern and generating all atching strings over Σ. Let M[0.. σ q, 0.. ] be a -d array and q be the q-gra starting at x[i], then for every s Σ q such that s = q we store M[ρ(s)][i] 0 where ρ(s) is the nuerical representation of s.. This procedure is done for every q-gra in the pattern. Clearly for each q-gra considered there are at ost σ q atching factors and no ore than q-gras will be considered giving O(σ q ) tie. Siilarly M is of size at ost σ q and each list is not longer than giving O(σ q ) space. We preprocess each array in M for next saller value queries. Next saller value queries take O() tie after linear preprocessing [3]. We refer to this schee as q-weak. Consider a sequence s, s,.., s l of l consecutive, non-overlapping q-gras read fro a text t, we ake use of the observation that any occurrence of the pattern containing those q-gras ust contain those q-gras in the sae order that they appear in the text. Whilst it is possible to check if all possible start locations have the q-gras occurring consecutively this search ay be costly. We weaken the restriction on q-gras and require only that for soe s i occurring at position j in the pattern that there exists an occurrence of s i+ at soe h > j + q. If there are ultiple occurrences then we take the iniu starting position greater than j + q. Given a q-weak index, l query q-gras such that s i occurs at position j, this corresponds to checking if s i+ exists and then perforing a next saller value query on M[ρ(s i+ )][j + q]. We denote this query by Weak-Order-Query(s, s,.., s l ) and with this we get the following. Lea 3. Let x be a pattern of length with g wildcards, t be a text of length n with no wildcards and let s, s,.., s l be a sequence of l consecutive, non-overlapping q-gras read fro t. Given a q-weak-fast index of x we can perfor Weak-Order-Query(s, s,.., s l ) in O(lq + l). 4. Wildcards in Text Only We begin by considering the proble where wildcards ay appear only in the text and refer to the algorith of this section as Algorith Wt. First we will establish the probability of a rando string of length x log σ+ containing wildcards atching soe factor of the pattern. This choice of length will becoe clear later. Lea 4. Let u be a rando string of length x log σ+ over Σ {φ} and let v be a string of length over Σ. The probability of u atching a factor of v is at ost x. Applying index q-basic we can see that for q = 3 log σ+ this is a polynoial preprocessing schee and the axiu exponent occurs when σ =. For σ = we have that

6 6 On the Average-case Coplexity of Pattern Matching with Wildcards 3 3 log and so the total preprocessing is O( 9.3 log ). Asyptotically on the alphabet size this becoes O( 3 log ). Consider a sliding window of length placed over the text and that for each window we will check if the suffix of length 3 log σ+ corresponds with any factor of the pattern. If the suffix corresponds with a factor of the pattern there ay be a atch and we are required to verify this window. To verify the window we run algorith VER(i, x) in tie O( ). For each window on the text we have a total possible running tie of O( + log σ+ ). The probability of the suffix corresponding with a factor of the pattern is / by Lea 4, so the expected tie spent at each window is O(log σ+ ) or O(log σ ). If the window is verified then it is possible to shift by characters, otherwise the window is not verified n and can we shift by q characters. This eans there there are at ost q windows, each has an expected runtie of O(log σ ), giving us an overall average-case running tie of O(n log σ /). Fro the above discussion, and noting that Yao s lower bound [7] for exact string atching is O(n log σ /), we get the following result: Theore 5. Algorith Wt has optial average-case search tie O(n log σ /) with 3 log σ+ 3 log σ+ O((σ + ) logσ ) preprocessing tie and O((σ + ) ) space. 4.3 Wildcards in the Pattern In this section we consider the case where wildcards ay appear in both the pattern and text. For a pattern x we denote the nuber of wildcards in the pattern by g. We refer to the algorith of this section as Algorith Wp. We could use dictionary q-basic but for this proble we would have that q > g. This eans that the space coplexity of the dictionary would becoe Ω((σ + ) g ) which is far too high. Instead we will use dictionary q-weak-fast. We now establish the probability of rando string atching a pattern with g wildcards. Lea 6. Let u be a rando string of size g + x log σ over Σ and let v be a string of size > g + log σ over Σ {φ} with g wildcards. The probability of u atching a factor v is no ore than x. For this algorith we create a sliding window on the text of length and build a q-weak-fast index over x with q = 3 log σ. For each window on the text we denote the start position by i and read g q + q-gras starting at t[i + g q ] and perfor a Weak-Order-Query on the. If the Weak-Order-Query returns true then we perfor VER(i, x) to verify all possible start positions in the current window and shift by positions. If the Weak-Order-Query returns false we shift by g q. The iniu shift the algorith akes is g 3 log σ, n so there will be at ost g 3 log σ windows on the text and at each window we ay do O( + g + log σ ) work in the worst case. The probability that we will need to verify a window is / by Lea 6, this gives us an expected tie of O(g + log σ ) per window. n For the algorith to achieve the claied runtie it ust be the case that g 3 log σ = O( n ). To satisfy this it follows that g + 3 log σ c for soe 0 < c <. This places the following condition on the wildcard ratio of our algorith: g < c 3 log σ We also have an additional restriction, we ust be able to guarantee that we are reading enough new rando characters after each shift that Lea 6 still holds. This places the

7 Carl Barton 7 additional restriction that ust be at least twice the length of the shortest shift. So it ust hold that > (g + 3 log σ ) to ensure that in all cases we read enough new q-gras fro a window for the above analysis to hold. After rearrangeent this places the following restriction on our algorith: g < 3 log σ Clearly the second condition places the strictest condition on the wildcard ratio. Fro the above discussion we achieve the following result: Theore 7. Algorith Wp has average-case search tie O( n(g+log σ ) ) with O( 4 ) preprocessing tie and O( 4 ) space, for g < 3 log σ. The wildcard ratio we specify is quite perissive as 3 log σ = o() so for any g/ < / it is possible to pick a sufficiently large value of such that the algorith can run in the claied running tie. In the following theore we show that for any fixed algorith it is ipossible to do any better than this. We show that for any integer g, there exists a lower bound of Ω( ng ) character inspections for any fixed algoriths. Theore 8. Algorith Wp has average-case search tie O( n(g+log σ ) ) and no fixed algorith can do better. 5 A General Lower Bound In the previous section we have considered the average-case coplexity of each algorith. Fixed algoriths consider the characters in each window of the text in a fixed order, an approach that is ubiquitous in string algoriths. We have shown that for any fixed inspection schee there exists patterns that perfor badly on average. In this section we consider nonfixed algoriths and derive an average-case lower bound for any algorith solving proble with an arbitrary pattern and any value for g. This bound gives a provable separation in coplexity between exact string atching and wildcard atching. Clearly a lower bound for this proble also lower bounds the proble where wildcards appear in both the pattern and the text. When considering this proble the order characters are inspected becoes very iportant. Soe inspection schees do not have uch effect on the expected nuber of candidates not yet ruled out. It is the case that inspections that, should wildcards not be present, would lead to a large reduction in the expected nuber of candidates ay give very little inforation when wildcards exist. Consider that we have a pattern of length with g < wildcard characters and a text of length n. Partition the text into non-overlapping blocks of size, and only consider that we ust report all atches within each block. This is an optiistic assuption as this excludes those atches which overlap two blocks. In the following section we will deterine a lower bound for the nuber of character inspections required for one block. The lower bound for the general proble can then be derived. For each block we call 0,.., possible starting positions candidates and when we inspect a character fro the block we call this a block access. The candidates affected by a block access are intersected by it. Given a block access to position z in block b we can only rule out candidate c if there exists soe y such that x[y] b[z] and c + y = z. For all non-wildcard positions intersected by a given block access i j, there is a probability of at ost /σ that the candidate will not be ruled out. For those candidates where this block

8 8 On the Average-case Coplexity of Pattern Matching with Wildcards access intersects a wildcard there is probability it will not be ruled out. We outline a few optiistic assuptions used in our analysis. Any block access intersects all candidates. Intersections are distributed uniforly across all candidates. The effect of this is that g candidates have a chance of being ruled out at every block access. After k block accesses in this odel we have ade ( g)k intersections and we assue that these are distributed uniforly across all candidates. This is optiistic as this eans that we always intersect candidates with the highest probability of occurrence, soething which ay not actually be possible. This eans that we ay only overestiate the nuber of candidates ruled out and our result is a lower bound. Clearly the expected nuber of candidate not ruled out is then be evaluated as follows: i=0 σ ( g)k = σ ( g)k For each candidate we need to either rule it out as a possible starting position or declare a atch. So the optial is to deterine when we would expect to have ruled out every candidate position or read characters. We iniise the following so that we expect to have at ost one candidate left or until we have read all positions. Rearranging this we get the following: σ ( g)k log σ log σ g ( g)k Now we know that the lower bound for each block is Ω( log σ g ) and there are n/ blocks. The result below then follows: Theore 9. The average-case lower bound for wildcard atching with wildcards only in the pattern is Ω( n log σ g ). Clearly this results suggests that pattern atching with wildcards is on average asyptotically harder than exact string atching when g = f(), where f() is soe function asyptotically saller than O(). Now we discuss a strategy for inspecting positions of a block and show that a greedy schee perfors an optial nuber of character coparisons. By greedy we ean that at each step the block access which would ost greatly reduce the expected nuber of reaining positions is chosen. For a candidate i let i be the nuber of ties it has been intersected at a non-wildcard position. Now for each position in a block 0 i < let B i be the set of candidates that the block access i intersects. The effect on the expected nuber of candidates not ruled out by inspecting soe position l is given by the following. Let U = {0,,.., }\B l and i denote the nuber of ties candidate i has been intersected before the block access to l. k i=0 σ i j U σ j h B l σ h+

9 Carl Barton 9 The greedy inspection axiises the last two ters of the above suation at each step. The intuition behind this schee is based on our proof of the lower bound presented above. In the proof we saw that evenly distributing accesses across all candidates iniises the expectation. The greedy schee attepts to siulate this behaviour by picking the access which ost iniises the probability at each step. We now show that this is in fact optial. Theore 0. The greedy inspection schee perfors an optial nuber of character coparisons. We have shown that a greedy inspection schee iniises the nuber of character inspections required for any g, but this does not give us any explicit upper bound. We now give an upper bound on the average-case search tie significantly better than that in algorith Wp for all but the ost extree values of g. 6 An Iproved Upper Bound As before consider that we have a pattern of length which contains g < wildcard characters and a text of length n. We partition the text into non-overlapping blocks of size and only consider that we have to report all atches within each block. Although this is an optiistic assuption, we will show how to convert this to a true upper bound later in this section. In the rest of this section we will deterine an upper bound for the nuber of character inspections required for one block. First we recall the definition of the ε-dense set cover proble along with a known approxiation results. Proble 3. Given ε > 0, a set of eleents U = {,, 3,.., r}, a faily S of l sets such that every eleent of U occurs in εr sets and the union of S equals U. Find the iniu nuber of sets fro S such that their union is U. The general set cover proble is known to be NP-hard, however the following approxiation result is known for the ε-dense set cover proble. Lea ([8]). There exists an approxiation algorith for the ε-dense set cover proble with output size log r where U = r. ε We define the following faily of sets for j < S j = { i {,,.., } : i < j and c > 0 s.t. x[c] φ and c + i = j} Our goal in this section is to deterine how any inspections it takes to ake the probability of any candidate not being rules out saller enough to ake verification tie negligible. One way to do this is to ensure that all candidates have been intersected at least 3 log σ, causing the probability one is not ruled out to be at ost /. The difficulty in deriving the upper bound is we ust explicitly consider how each candidate is affected by a block access. To do this we use the above result on ε-dense set covers. By the definition of S = S,.., S it is clear that each eleent occurs in g sets, so the proble can be seen as an g -dense set cover proble. We will apply the ε-dense set cover proble, reove the approxiate set cover and apply the ε-dense set cover proble again until we achieve 3 log σ intersections per candidate. We now define a faily of functions whose arguent is the pattern x which relate to repeated applications of the ε-dense set cover proble applied on S. F 0 (x) = log + g

10 0 On the Average-case Coplexity of Pattern Matching with Wildcards F i (x) = log + g + log F 0 (x) + log F (x) log F i (x) We are now in a position to define a density function which gives the density of the reaining sets after i applications of the ε-dense set cover proble. D 0 (x) = g D i (x) = g log F 0 (x) log F (x).. log F i (x) The nuber of characters inspected to guarantee i > 0 intersections per candidate is then given by G i (x) = log F 0 (x) + log F (x) log F i (x) We now show a nuber of properties of the above functions. We wish to show that for soe integer i 0, a constant 0 < ɛ < and g + log F 0 (x) + log F (x) log F i (x) < ɛ. log + ɛ li F i (x) log We proceed by induction on i. Let i = 0 and g < ɛ then clearly log + ɛ li + g log Let i = k, 0 < ɛ < and assue for all natural nubers less than k that the induction hypothesis holds. We focus on the lower bound as the upper bound is obvious. Let i = k + and g + log F 0 (x) + log F (x) log < ɛ. Applying the induction hypothesis this is at least F k (x) g + log + log log < ɛ as and therefore. log +ɛ log +ɛ log +ɛ }{{} k+ ties It then follows that li + g + log F 0 (x) + log F (x) log F k (x) ɛ log + ɛ li F k+ (x) log With this result we get that for sufficiently large and g+ log F 0 (x) + log F (x) +..+ log F i (x) < ɛ. G i (x) = O(i log ) Recall that we wish to intersect each candidate at least 3 log σ so for i = 3 log σ + we get the nuber of inspected characters is G 3 log σ + (x) = O(log σ log ) for sufficiently large. This upper bound is for a block of the text of size and assues all character inspections can be considered as independent tests. To convert this to a general bound for a text of size n consider the text partitioned into blocks of size that overlap by characters. After inspecting O(log σ log ) characters we expect that we can discard the entire block and can shift by characters. It ay be the case that we have already read soe of the characters we need to inspect, but as long as the window is twice the size of O(log σ log )

11 Carl Barton the analysis still holds. Coputing the greedy inspection schee can be trivially done in O( 3 ) and an index can be built by building a finite state achine eorising the candidates still valid for any possible character read in the order specified, this takes no ore than σ O(log σ log ) which is polynoial for σ = O() and as σ O(log σ log ) = O(log σ (log )) is quasi-polynoial for larger alphabets, therefore. Theore. For any constant 0 < ɛ < there exists an algorith for Proble with average-case search tie O( n log σ log ) when g + log F 0 (x) + log F (x) log F 3 log (x) < ɛ. Cobining all the above we know that there exists an algorith that has the following bounds on the nuber of character required to be inspected. Let OPT denote the nuber of characters inspected by the optial schee with the specified conditions on g. 7 Conclusions and Future Work ( n logσ ) ( n logσ log O OPT O ) g In this paper we have investigated the average-case coplexity of two wildcard atching probles. We have considered two odels of coputation and have shown a big jup in the coplexity of the two. The question of a tight bound on the search coplexity of pattern atching with wildcards reains open. We have shown an algorith which has optial average-case search tie and upper/lower bounds on the tie coplexity that are only different by a logarithic factor. This lower bound shows the first provable separation in tie coplexity between wildcard atching and exact atching in the word RAM odel. Although [3] showed that in the worst case the proble is equivalent to coputing boolean convolutions and any believe this cannot be solved in linear tie, no superlinear lower bound on this is known. We conjecture that the optial nuber of character inspections is in fact the lower bound stated in Theore 9. References Ricardo Baeza-Yates and Gonzalo Navarro. New and faster filters for ultiple approxiate string atching. In Rando Structures and Algoriths (RSA), page 00, 998. Ricardo. A. Baeza-Yates. Algoriths for string searching. SIGIR Foru, 3(3-4):34 58, April Jéréy Barbay, Johannes Fischer, and Gonzalo Navarro. Lr-trees: Copressed indices, adaptive sorting, and copressed perutations. Theo. Cop. Sci., 459:6 4, 0. 4 Philip Bille, Inge Li Gørtz, Hjalte Wedel Vildhøj, and Søren Vind. String indexing for patterns with wildcards. In Fedor V. Foin and Petteri Kaski, editors, Algorith Theory, volue 7357 of Lecture Notes in Coputer Science, pages Peter Clifford and Raphaël Clifford. Siple deterinistic wildcard atching. Inf. Process. Lett., 0():53 54, January Richard Cole, Lee-Ad Gottlieb, and Moshe Lewenstein. Dictionary atching and indexing with errors and don t cares. In Proc. of the 36th annual ACM STOC, pages ACM, Richard Cole and Raesh Hariharan. Verifying candidate atches in sparse and wildcard atching. In Proc. of the 34th annual ACM STOC, pages 59 60, Maxie Crocheore, Artur Czuaj, Leszek Gasieniec, Stefan Jaroinek, Thierry Lecroq, Wojciech Plandowski, and Wojciech Rytter. Speeding up two string-atching algoriths. Algorithica, (4-5):47 67, 994.

12 On the Average-case Coplexity of Pattern Matching with Wildcards 9 Michael J. Fischer and Michael S. Paterson. String-atching and other products. Technical report, Cabridge, MA, USA, Kio Fredriksson and Gonzalo Navarro. Average-optial ultiple approxiate string atching. In Ricardo Baeza-Yates, Edgar Chávez, and Maxie Crocheore, editors, Cobinatorial Pattern Matching, volue 676 of LNCS, pages Kio Fredriksson and Gonzalo Navarro. Average-optial single and ultiple approxiate string atching. J. Exp. Algorithics, 9, Deceber 004. Kio Fredriksson and Gonzalo Navarro. Flexible usic retrieval in sublinear tie. In In Proc. 0th Prague Stringology Conference, pages 74 88, Paweł Gawrychowski, Moshe Lewenstein, and PatrickK. Nicholson. Weighted ancestors in suffix trees. In Andreas S. Schulz and Dorothea Wagner, editors, Algoriths - ESA 04, volue 8737 of LNCS, pages Wing-Kai Hon, Tsung-Han Ku, Rahul Shah, Shara V. Thankachan, and Jeffrey Scott Vitter. Copressed text indexing with wildcards. In Roberto Grossi, Fabrizio Sebastiani, and Fabrizio Silvestri, editors, SPIRE, volue 704 of LNCS, pages Costas S. Iliopoulos and M. Sohel Rahan. Pattern atching algoriths with don t cares. In 33rd International Conference on Current Trends in Theory and Practice of Coputer Science, SOFSEM (), pages 6 6, Piotr Indyk. Faster algoriths for string atching probles: Matching the convolution bound. In Proc. of the 39th Syposiu on Foundations of Coputer Science, pages 66 73, Ada Kalai. Efficient pattern-atching with don t cares. In Proc. of the 3th annual ACM-SIAM SODA, pages , Marek Karpinski and Alexander Zelikovsky. Approxiating dense cases of covering probles. Technical report, Donald E. Knuth, Jaes H. Morris Jr., and Vaughan R. Pratt. Fast pattern atching in strings. SIAM J. Coput., 6():33 350, Tak-Wah La, Wing-Kin Sung, Siu-Lung Ta, and Siu-Ming Yiu. Space efficient indexes for string atching with dont cares. In Takeshi Tokuyaa, editor, Algoriths and Coputation, volue 4835 of LNCS, pages Moshe Lewenstein, Yakov Nekrich, and Jeffrey Scott Vitter. Space-efficient string indexing for wildcard pattern atching. CoRR, abs/40.065, 04. Moritz G. Maaß. Average-case analysis of approxiate trie search. In SuleyanCenk Sahinalp, S. Muthukrishnan, and Ugur Dogrusoz, editors, Cobinatorial Pattern Matching, volue 309 of LNCS, pages S. Muthukrishnan and Krishna Pale. Non-standard stringology: Algoriths and coplexity. In Proceedings of the Twenty-sixth Annual ACM Syposiu on Theory of Coputing, STOC 94, pages , New York, NY, USA, 994. ACM. 4 Gonzalo Navarro and Mathieu Raffinot. Flexible Pattern Matching in Strings: Practical On-line Search Algoriths for Texts and Biological Sequences Ron Y. Pinter. Efficient string atching with don t-care patterns. In Alberto Apostolico and Zvi Galil, editors, Cobinatorial Algoriths on Words, volue of NATO ASI Series, pages Chris Thachuk. Succincter text indexing with wildcards. In Raffaele Giancarlo and Giovanni Manzini, editors, CPM, volue 666 of LNCS, pages Andrew Chi-Chih Yao. The coplexity of pattern atching for a rando string. SIAM J. Coput., 8(3): , 979.

13 Carl Barton 3 Appendix Proof of Lea 4 Proof. The probability of a character fro Σ atching a randoly picked character of Σ {φ} is σ+. The probability of r characters atching in this way is given by ( σ+ )r, by setting r = x log σ+ we get that ( σ+ )r =. There are at ost x log x σ+ + factors of this length in v and so the probability of occurrence is no ore than x. Proof of Lea 6 Proof. In the worst case all g wildcards of v appear in a single factor of size g + x log σ, this factor contains x log σ positions which are not wildcards and the probability that these atch the corresponding characters in u is no ore than. For any factor with less x than g wildcards the probability is. Pessiistically assue all factors of v of length x g + x log σ atch with probability, there are (g + x log x σ ) factors of this length so the probability is no ore than x. Proof of Theore Proof. Recall that the lower bound for exact string atching is Ω(n log σ /). We now show that any fixed algorith has a lower bound of Ω( ng ). Assue the text is partitioned into non-overlapping blocks of size and that we only want to find occurrences contained entirely within these blocks. Let π denote all the perutations of (0,,.., ) and assue that we exaine the characters of each block in the sae fixed but arbitrary order (i 0, i,.., i ) π. We can construct patterns, for any g <, such that we ust exaine at least g + characters before all start positions can be ruled out in the following way. If for 0 j < g all i j occur within a range of positions then we place the wildcards in positions i 0, i,.., i g of the pattern. Otherwise for 0 j < g and i j < place a wildcard at position i j. Any reaining wildcards ay be placed anywhere in the pattern; the reaining positions of the pattern are rando characters fro Σ. After inspecting characters i 0, i,.., i g of the block at least the first position can neither be ruled out nor declared as a atch. Cobining the lower bound of Yao and this we see that any fixed algorith has a lower bound of Ω(n(g + log σ )/) for this proble. Algorith Wp runs in average-case tie O(n(g + log σ )/), atching the lower bound; therefore the algorith is optial in the faily of fixed algoriths. Proof of Theore 5 Proof. The optial nuber of character coparisons is achieved by a schee iniising the nuber of block accesses needed to expect that less than candidate reains. An inspection schee L k is siply a k subset of the probing positions {0,,,.., }. For soe inspection schee L k we denote by E[L k ] the expected nuber of reaining candidates for inspection schee L k. We proceed by induction on the nuber of block access and clai that the greedy schee is an optial schee. Let G i = {ð, ð,.., ð i } be the first i block accesses ade by the greedy inspection schee. G i+ is then defined as G i ð i+ where ð i+ is the block access with the biggest effect on the expected nuber of candidates not ruled out. The base case is siple, we pick the block access which iniises the expected nuber of candidates not ruled out, by definition this is the iniu.

14 4 On the Average-case Coplexity of Pattern Matching with Wildcards Assue that for soe i it is true that for the greedy schee the nuber of expected candidates is the sallest possible after i accesses. Let K i = {ℸ, ℸ,.., ℸ i } be an arbitrary inspection schee with i block accesses. If G i = K i then E[G i+ ] E[K i+ ] and we are done, otherwise we have a few cases to consider. Consider G i+ and K i+, if ℸ i+ = ð i+ = l. Let B l be the set of candidates that the block access l intersects. If the following holds: Then so does the following: Additionally, if the following is true: Then so is the following: h B l σ ℸ h h B l σ ð h σ ℸ h+ σ ð h+ h B l h B l h B l σ ℸ h h B l σ ð h σ ℸ h+ σ ð h+ h B l h B l Therefore, should ℸ i+ = ð i+ then E[G i+ ] E[K i+ ]. Let C = {{0,,.. }\G i }\K i and ℸ i+ ð i+ if ð i+, ℸ i+ C then E[G i+ ] E[K i+ ], otherwise either ð i+ C or ℸ i+ C. If ℸ i+ ð i+ then either ℸ i+ G i or ð i+ K i. It is sufficient to show that there exists at least one possible choice for ð i+ that would cause E[G i+ ] E[K i+ ] in each case. Case : ℸ i+ G i Let ℸ u K i \G i, ℸ u ust exist as K i G i, and consider the inspection schee K i = {K i \ℸ u } ℸ i+. We clai that ð i+ = ℸ u ensures optiality. By the induction hypothesis E[K i ] E[G i]. Now let ð i+ = ℸ i+ = ℸ u, by the above analysis for when ð i+ = ℸ i+, E[G i+ ] E[K i+ ] and therefore E[G i+] E[K i+ ]. Case : ð i+ K i If ℸ i+ C, the greedy schee could also pick the block access ℸ i+ and reain optial so it ust be the case that E[G i+ ] E[K i+ ]. If ℸ i+ C then ℸ i+ G and by case the greedy schee is optial. Therefore E[G i+ ] E[K i+ ] and the stateent is proved.

Constant-Space String-Matching. in Sublinear Average Time. (Extended Abstract) Wojciech Rytter z. Warsaw University. and. University of Liverpool

Constant-Space String-Matching. in Sublinear Average Time. (Extended Abstract) Wojciech Rytter z. Warsaw University. and. University of Liverpool Constant-Space String-Matching in Sublinear Average Tie (Extended Abstract) Maxie Crocheore Universite de Marne-la-Vallee Leszek Gasieniec y Max-Planck Institut fur Inforatik Wojciech Rytter z Warsaw University

More information

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices CS71 Randoness & Coputation Spring 018 Instructor: Alistair Sinclair Lecture 13: February 7 Disclaier: These notes have not been subjected to the usual scrutiny accorded to foral publications. They ay

More information

arxiv: v1 [cs.ds] 3 Feb 2014

arxiv: v1 [cs.ds] 3 Feb 2014 arxiv:40.043v [cs.ds] 3 Feb 04 A Bound on the Expected Optiality of Rando Feasible Solutions to Cobinatorial Optiization Probles Evan A. Sultani The Johns Hopins University APL evan@sultani.co http://www.sultani.co/

More information

A Note on Scheduling Tall/Small Multiprocessor Tasks with Unit Processing Time to Minimize Maximum Tardiness

A Note on Scheduling Tall/Small Multiprocessor Tasks with Unit Processing Time to Minimize Maximum Tardiness A Note on Scheduling Tall/Sall Multiprocessor Tasks with Unit Processing Tie to Miniize Maxiu Tardiness Philippe Baptiste and Baruch Schieber IBM T.J. Watson Research Center P.O. Box 218, Yorktown Heights,

More information

Homework 3 Solutions CSE 101 Summer 2017

Homework 3 Solutions CSE 101 Summer 2017 Hoework 3 Solutions CSE 0 Suer 207. Scheduling algoriths The following n = 2 jobs with given processing ties have to be scheduled on = 3 parallel and identical processors with the objective of iniizing

More information

Approximation in Stochastic Scheduling: The Power of LP-Based Priority Policies

Approximation in Stochastic Scheduling: The Power of LP-Based Priority Policies Approxiation in Stochastic Scheduling: The Power of -Based Priority Policies Rolf Möhring, Andreas Schulz, Marc Uetz Setting (A P p stoch, r E( w and (B P p stoch E( w We will assue that the processing

More information

List Scheduling and LPT Oliver Braun (09/05/2017)

List Scheduling and LPT Oliver Braun (09/05/2017) List Scheduling and LPT Oliver Braun (09/05/207) We investigate the classical scheduling proble P ax where a set of n independent jobs has to be processed on 2 parallel and identical processors (achines)

More information

Lecture 21. Interior Point Methods Setup and Algorithm

Lecture 21. Interior Point Methods Setup and Algorithm Lecture 21 Interior Point Methods In 1984, Kararkar introduced a new weakly polynoial tie algorith for solving LPs [Kar84a], [Kar84b]. His algorith was theoretically faster than the ellipsoid ethod and

More information

Algorithms for parallel processor scheduling with distinct due windows and unit-time jobs

Algorithms for parallel processor scheduling with distinct due windows and unit-time jobs BULLETIN OF THE POLISH ACADEMY OF SCIENCES TECHNICAL SCIENCES Vol. 57, No. 3, 2009 Algoriths for parallel processor scheduling with distinct due windows and unit-tie obs A. JANIAK 1, W.A. JANIAK 2, and

More information

This model assumes that the probability of a gap has size i is proportional to 1/i. i.e., i log m e. j=1. E[gap size] = i P r(i) = N f t.

This model assumes that the probability of a gap has size i is proportional to 1/i. i.e., i log m e. j=1. E[gap size] = i P r(i) = N f t. CS 493: Algoriths for Massive Data Sets Feb 2, 2002 Local Models, Bloo Filter Scribe: Qin Lv Local Models In global odels, every inverted file entry is copressed with the sae odel. This work wells when

More information

Handout 7. and Pr [M(x) = χ L (x) M(x) =? ] = 1.

Handout 7. and Pr [M(x) = χ L (x) M(x) =? ] = 1. Notes on Coplexity Theory Last updated: October, 2005 Jonathan Katz Handout 7 1 More on Randoized Coplexity Classes Reinder: so far we have seen RP,coRP, and BPP. We introduce two ore tie-bounded randoized

More information

ASSUME a source over an alphabet size m, from which a sequence of n independent samples are drawn. The classical

ASSUME a source over an alphabet size m, from which a sequence of n independent samples are drawn. The classical IEEE TRANSACTIONS ON INFORMATION THEORY Large Alphabet Source Coding using Independent Coponent Analysis Aichai Painsky, Meber, IEEE, Saharon Rosset and Meir Feder, Fellow, IEEE arxiv:67.7v [cs.it] Jul

More information

1 Proof of learning bounds

1 Proof of learning bounds COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #4 Scribe: Akshay Mittal February 13, 2013 1 Proof of learning bounds For intuition of the following theore, suppose there exists a

More information

The Weierstrass Approximation Theorem

The Weierstrass Approximation Theorem 36 The Weierstrass Approxiation Theore Recall that the fundaental idea underlying the construction of the real nubers is approxiation by the sipler rational nubers. Firstly, nubers are often deterined

More information

Polygonal Designs: Existence and Construction

Polygonal Designs: Existence and Construction Polygonal Designs: Existence and Construction John Hegean Departent of Matheatics, Stanford University, Stanford, CA 9405 Jeff Langford Departent of Matheatics, Drake University, Des Moines, IA 5011 G

More information

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon Model Fitting CURM Background Material, Fall 014 Dr. Doreen De Leon 1 Introduction Given a set of data points, we often want to fit a selected odel or type to the data (e.g., we suspect an exponential

More information

time time δ jobs jobs

time time δ jobs jobs Approxiating Total Flow Tie on Parallel Machines Stefano Leonardi Danny Raz y Abstract We consider the proble of optiizing the total ow tie of a strea of jobs that are released over tie in a ultiprocessor

More information

arxiv: v1 [cs.ds] 29 Jan 2012

arxiv: v1 [cs.ds] 29 Jan 2012 A parallel approxiation algorith for ixed packing covering seidefinite progras arxiv:1201.6090v1 [cs.ds] 29 Jan 2012 Rahul Jain National U. Singapore January 28, 2012 Abstract Penghui Yao National U. Singapore

More information

1 Identical Parallel Machines

1 Identical Parallel Machines FB3: Matheatik/Inforatik Dr. Syaantak Das Winter 2017/18 Optiizing under Uncertainty Lecture Notes 3: Scheduling to Miniize Makespan In any standard scheduling proble, we are given a set of jobs J = {j

More information

On the Inapproximability of Vertex Cover on k-partite k-uniform Hypergraphs

On the Inapproximability of Vertex Cover on k-partite k-uniform Hypergraphs On the Inapproxiability of Vertex Cover on k-partite k-unifor Hypergraphs Venkatesan Guruswai and Rishi Saket Coputer Science Departent Carnegie Mellon University Pittsburgh, PA 1513. Abstract. Coputing

More information

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis E0 370 tatistical Learning Theory Lecture 6 (Aug 30, 20) Margin Analysis Lecturer: hivani Agarwal cribe: Narasihan R Introduction In the last few lectures we have seen how to obtain high confidence bounds

More information

Midterm 1 Sample Solution

Midterm 1 Sample Solution Midter 1 Saple Solution NOTE: Throughout the exa a siple graph is an undirected, unweighted graph with no ultiple edges (i.e., no exact repeats of the sae edge) and no self-loops (i.e., no edges fro a

More information

Computable Shell Decomposition Bounds

Computable Shell Decomposition Bounds Coputable Shell Decoposition Bounds John Langford TTI-Chicago jcl@cs.cu.edu David McAllester TTI-Chicago dac@autoreason.co Editor: Leslie Pack Kaelbling and David Cohn Abstract Haussler, Kearns, Seung

More information

A Better Algorithm For an Ancient Scheduling Problem. David R. Karger Steven J. Phillips Eric Torng. Department of Computer Science

A Better Algorithm For an Ancient Scheduling Problem. David R. Karger Steven J. Phillips Eric Torng. Department of Computer Science A Better Algorith For an Ancient Scheduling Proble David R. Karger Steven J. Phillips Eric Torng Departent of Coputer Science Stanford University Stanford, CA 9435-4 Abstract One of the oldest and siplest

More information

1 Generalization bounds based on Rademacher complexity

1 Generalization bounds based on Rademacher complexity COS 5: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #0 Scribe: Suqi Liu March 07, 08 Last tie we started proving this very general result about how quickly the epirical average converges

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Coputational and Statistical Learning Theory Proble sets 5 and 6 Due: Noveber th Please send your solutions to learning-subissions@ttic.edu Notations/Definitions Recall the definition of saple based Radeacher

More information

Upper bound on false alarm rate for landmine detection and classification using syntactic pattern recognition

Upper bound on false alarm rate for landmine detection and classification using syntactic pattern recognition Upper bound on false alar rate for landine detection and classification using syntactic pattern recognition Ahed O. Nasif, Brian L. Mark, Kenneth J. Hintz, and Nathalia Peixoto Dept. of Electrical and

More information

Randomized Recovery for Boolean Compressed Sensing

Randomized Recovery for Boolean Compressed Sensing Randoized Recovery for Boolean Copressed Sensing Mitra Fatei and Martin Vetterli Laboratory of Audiovisual Counication École Polytechnique Fédéral de Lausanne (EPFL) Eail: {itra.fatei, artin.vetterli}@epfl.ch

More information

The Frequent Paucity of Trivial Strings

The Frequent Paucity of Trivial Strings The Frequent Paucity of Trivial Strings Jack H. Lutz Departent of Coputer Science Iowa State University Aes, IA 50011, USA lutz@cs.iastate.edu Abstract A 1976 theore of Chaitin can be used to show that

More information

On the Maximum Number of Codewords of X-Codes of Constant Weight Three

On the Maximum Number of Codewords of X-Codes of Constant Weight Three On the Maxiu Nuber of Codewords of X-Codes of Constant Weight Three arxiv:190.097v1 [cs.it] 2 Mar 2019 Yu Tsunoda Graduate School of Science and Engineering Chiba University 1- Yayoi-Cho Inage-Ku, Chiba

More information

Lecture 9 November 23, 2015

Lecture 9 November 23, 2015 CSC244: Discrepancy Theory in Coputer Science Fall 25 Aleksandar Nikolov Lecture 9 Noveber 23, 25 Scribe: Nick Spooner Properties of γ 2 Recall that γ 2 (A) is defined for A R n as follows: γ 2 (A) = in{r(u)

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE227C (Spring 2018): Convex Optiization and Approxiation Instructor: Moritz Hardt Eail: hardt+ee227c@berkeley.edu Graduate Instructor: Max Sichowitz Eail: sichow+ee227c@berkeley.edu October

More information

Convex Programming for Scheduling Unrelated Parallel Machines

Convex Programming for Scheduling Unrelated Parallel Machines Convex Prograing for Scheduling Unrelated Parallel Machines Yossi Azar Air Epstein Abstract We consider the classical proble of scheduling parallel unrelated achines. Each job is to be processed by exactly

More information

arxiv: v1 [cs.ds] 17 Mar 2016

arxiv: v1 [cs.ds] 17 Mar 2016 Tight Bounds for Single-Pass Streaing Coplexity of the Set Cover Proble Sepehr Assadi Sanjeev Khanna Yang Li Abstract arxiv:1603.05715v1 [cs.ds] 17 Mar 2016 We resolve the space coplexity of single-pass

More information

Efficient Filter Banks And Interpolators

Efficient Filter Banks And Interpolators Efficient Filter Banks And Interpolators A. G. DEMPSTER AND N. P. MURPHY Departent of Electronic Systes University of Westinster 115 New Cavendish St, London W1M 8JS United Kingdo Abstract: - Graphical

More information

Optimal Parallel Sux Tree Construction. Ramesh Hariharan y. April 1, Abstract

Optimal Parallel Sux Tree Construction. Ramesh Hariharan y. April 1, Abstract Optial Parallel Sux Tree Construction Raesh Hariharan y April 1, 1997 Abstract An O()-work, O()-space, O(log 4 )-tie CREW-PRAM algorith for constructing the sux tree of a string s of length drawn fro any

More information

arxiv: v1 [cs.ds] 15 Feb 2012

arxiv: v1 [cs.ds] 15 Feb 2012 Linear-Space Substring Range Counting over Polylogarithmic Alphabets Travis Gagie 1 and Pawe l Gawrychowski 2 1 Aalto University, Finland travis.gagie@aalto.fi 2 Max Planck Institute, Germany gawry@cs.uni.wroc.pl

More information

A Self-Organizing Model for Logical Regression Jerry Farlow 1 University of Maine. (1900 words)

A Self-Organizing Model for Logical Regression Jerry Farlow 1 University of Maine. (1900 words) 1 A Self-Organizing Model for Logical Regression Jerry Farlow 1 University of Maine (1900 words) Contact: Jerry Farlow Dept of Matheatics Univeristy of Maine Orono, ME 04469 Tel (07) 866-3540 Eail: farlow@ath.uaine.edu

More information

Fast Montgomery-like Square Root Computation over GF(2 m ) for All Trinomials

Fast Montgomery-like Square Root Computation over GF(2 m ) for All Trinomials Fast Montgoery-like Square Root Coputation over GF( ) for All Trinoials Yin Li a, Yu Zhang a, a Departent of Coputer Science and Technology, Xinyang Noral University, Henan, P.R.China Abstract This letter

More information

Note on generating all subsets of a finite set with disjoint unions

Note on generating all subsets of a finite set with disjoint unions Note on generating all subsets of a finite set with disjoint unions David Ellis e-ail: dce27@ca.ac.uk Subitted: Dec 2, 2008; Accepted: May 12, 2009; Published: May 20, 2009 Matheatics Subject Classification:

More information

Fixed-to-Variable Length Distribution Matching

Fixed-to-Variable Length Distribution Matching Fixed-to-Variable Length Distribution Matching Rana Ali Ajad and Georg Böcherer Institute for Counications Engineering Technische Universität München, Gerany Eail: raa2463@gail.co,georg.boecherer@tu.de

More information

On Poset Merging. 1 Introduction. Peter Chen Guoli Ding Steve Seiden. Keywords: Merging, Partial Order, Lower Bounds. AMS Classification: 68W40

On Poset Merging. 1 Introduction. Peter Chen Guoli Ding Steve Seiden. Keywords: Merging, Partial Order, Lower Bounds. AMS Classification: 68W40 On Poset Merging Peter Chen Guoli Ding Steve Seiden Abstract We consider the follow poset erging proble: Let X and Y be two subsets of a partially ordered set S. Given coplete inforation about the ordering

More information

Computable Shell Decomposition Bounds

Computable Shell Decomposition Bounds Journal of Machine Learning Research 5 (2004) 529-547 Subitted 1/03; Revised 8/03; Published 5/04 Coputable Shell Decoposition Bounds John Langford David McAllester Toyota Technology Institute at Chicago

More information

Estimating Entropy and Entropy Norm on Data Streams

Estimating Entropy and Entropy Norm on Data Streams Estiating Entropy and Entropy Nor on Data Streas Ait Chakrabarti 1, Khanh Do Ba 1, and S. Muthukrishnan 2 1 Departent of Coputer Science, Dartouth College, Hanover, NH 03755, USA 2 Departent of Coputer

More information

Lecture 18 April 26, 2012

Lecture 18 April 26, 2012 6.851: Advanced Data Structures Spring 2012 Prof. Erik Demaine Lecture 18 April 26, 2012 1 Overview In the last lecture we introduced the concept of implicit, succinct, and compact data structures, and

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Coputational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 2: PAC Learning and VC Theory I Fro Adversarial Online to Statistical Three reasons to ove fro worst-case deterinistic

More information

Bipartite subgraphs and the smallest eigenvalue

Bipartite subgraphs and the smallest eigenvalue Bipartite subgraphs and the sallest eigenvalue Noga Alon Benny Sudaov Abstract Two results dealing with the relation between the sallest eigenvalue of a graph and its bipartite subgraphs are obtained.

More information

A note on the multiplication of sparse matrices

A note on the multiplication of sparse matrices Cent. Eur. J. Cop. Sci. 41) 2014 1-11 DOI: 10.2478/s13537-014-0201-x Central European Journal of Coputer Science A note on the ultiplication of sparse atrices Research Article Keivan Borna 12, Sohrab Aboozarkhani

More information

e-companion ONLY AVAILABLE IN ELECTRONIC FORM

e-companion ONLY AVAILABLE IN ELECTRONIC FORM OPERATIONS RESEARCH doi 10.1287/opre.1070.0427ec pp. ec1 ec5 e-copanion ONLY AVAILABLE IN ELECTRONIC FORM infors 07 INFORMS Electronic Copanion A Learning Approach for Interactive Marketing to a Custoer

More information

Tight Information-Theoretic Lower Bounds for Welfare Maximization in Combinatorial Auctions

Tight Information-Theoretic Lower Bounds for Welfare Maximization in Combinatorial Auctions Tight Inforation-Theoretic Lower Bounds for Welfare Maxiization in Cobinatorial Auctions Vahab Mirrokni Jan Vondrák Theory Group, Microsoft Dept of Matheatics Research Princeton University Redond, WA 9805

More information

Combining Classifiers

Combining Classifiers Cobining Classifiers Generic ethods of generating and cobining ultiple classifiers Bagging Boosting References: Duda, Hart & Stork, pg 475-480. Hastie, Tibsharini, Friedan, pg 246-256 and Chapter 10. http://www.boosting.org/

More information

On Constant Power Water-filling

On Constant Power Water-filling On Constant Power Water-filling Wei Yu and John M. Cioffi Electrical Engineering Departent Stanford University, Stanford, CA94305, U.S.A. eails: {weiyu,cioffi}@stanford.edu Abstract This paper derives

More information

Improved Guarantees for Agnostic Learning of Disjunctions

Improved Guarantees for Agnostic Learning of Disjunctions Iproved Guarantees for Agnostic Learning of Disjunctions Pranjal Awasthi Carnegie Mellon University pawasthi@cs.cu.edu Avri Blu Carnegie Mellon University avri@cs.cu.edu Or Sheffet Carnegie Mellon University

More information

Solutions of some selected problems of Homework 4

Solutions of some selected problems of Homework 4 Solutions of soe selected probles of Hoework 4 Sangchul Lee May 7, 2018 Proble 1 Let there be light A professor has two light bulbs in his garage. When both are burned out, they are replaced, and the next

More information

The degree of a typical vertex in generalized random intersection graph models

The degree of a typical vertex in generalized random intersection graph models Discrete Matheatics 306 006 15 165 www.elsevier.co/locate/disc The degree of a typical vertex in generalized rando intersection graph odels Jerzy Jaworski a, Michał Karoński a, Dudley Stark b a Departent

More information

Bloom Filters. filters: A survey, Internet Mathematics, vol. 1 no. 4, pp , 2004.

Bloom Filters. filters: A survey, Internet Mathematics, vol. 1 no. 4, pp , 2004. Bloo Filters References A. Broder and M. Mitzenacher, Network applications of Bloo filters: A survey, Internet Matheatics, vol. 1 no. 4, pp. 485-509, 2004. Li Fan, Pei Cao, Jussara Aleida, Andrei Broder,

More information

Lean Walsh Transform

Lean Walsh Transform Lean Walsh Transfor Edo Liberty 5th March 007 inforal intro We show an orthogonal atrix A of size d log 4 3 d (α = log 4 3) which is applicable in tie O(d). By applying a rando sign change atrix S to the

More information

On Process Complexity

On Process Complexity On Process Coplexity Ada R. Day School of Matheatics, Statistics and Coputer Science, Victoria University of Wellington, PO Box 600, Wellington 6140, New Zealand, Eail: ada.day@cs.vuw.ac.nz Abstract Process

More information

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization Recent Researches in Coputer Science Support Vector Machine Classification of Uncertain and Ibalanced data using Robust Optiization RAGHAV PAT, THEODORE B. TRAFALIS, KASH BARKER School of Industrial Engineering

More information

Lecture 21 Principle of Inclusion and Exclusion

Lecture 21 Principle of Inclusion and Exclusion Lecture 21 Principle of Inclusion and Exclusion Holden Lee and Yoni Miller 5/6/11 1 Introduction and first exaples We start off with an exaple Exaple 11: At Sunnydale High School there are 28 students

More information

Block designs and statistics

Block designs and statistics Bloc designs and statistics Notes for Math 447 May 3, 2011 The ain paraeters of a bloc design are nuber of varieties v, bloc size, nuber of blocs b. A design is built on a set of v eleents. Each eleent

More information

Compressed Index for Dynamic Text

Compressed Index for Dynamic Text Compressed Index for Dynamic Text Wing-Kai Hon Tak-Wah Lam Kunihiko Sadakane Wing-Kin Sung Siu-Ming Yiu Abstract This paper investigates how to index a text which is subject to updates. The best solution

More information

Randomized Accuracy-Aware Program Transformations For Efficient Approximate Computations

Randomized Accuracy-Aware Program Transformations For Efficient Approximate Computations Randoized Accuracy-Aware Progra Transforations For Efficient Approxiate Coputations Zeyuan Allen Zhu Sasa Misailovic Jonathan A. Kelner Martin Rinard MIT CSAIL zeyuan@csail.it.edu isailo@it.edu kelner@it.edu

More information

Research Article Rapidly-Converging Series Representations of a Mutual-Information Integral

Research Article Rapidly-Converging Series Representations of a Mutual-Information Integral International Scholarly Research Network ISRN Counications and Networking Volue 11, Article ID 5465, 6 pages doi:1.54/11/5465 Research Article Rapidly-Converging Series Representations of a Mutual-Inforation

More information

Non-Parametric Non-Line-of-Sight Identification 1

Non-Parametric Non-Line-of-Sight Identification 1 Non-Paraetric Non-Line-of-Sight Identification Sinan Gezici, Hisashi Kobayashi and H. Vincent Poor Departent of Electrical Engineering School of Engineering and Applied Science Princeton University, Princeton,

More information

DSPACE(n)? = NSPACE(n): A Degree Theoretic Characterization

DSPACE(n)? = NSPACE(n): A Degree Theoretic Characterization DSPACE(n)? = NSPACE(n): A Degree Theoretic Characterization Manindra Agrawal Departent of Coputer Science and Engineering Indian Institute of Technology, Kanpur 208016, INDIA eail: anindra@iitk.ernet.in

More information

Stochastic Machine Scheduling with Precedence Constraints

Stochastic Machine Scheduling with Precedence Constraints Stochastic Machine Scheduling with Precedence Constraints Martin Skutella Fakultät II, Institut für Matheatik, Sekr. MA 6-, Technische Universität Berlin, 0623 Berlin, Gerany skutella@ath.tu-berlin.de

More information

Ştefan ŞTEFĂNESCU * is the minimum global value for the function h (x)

Ştefan ŞTEFĂNESCU * is the minimum global value for the function h (x) 7Applying Nelder Mead s Optiization Algorith APPLYING NELDER MEAD S OPTIMIZATION ALGORITHM FOR MULTIPLE GLOBAL MINIMA Abstract Ştefan ŞTEFĂNESCU * The iterative deterinistic optiization ethod could not

More information

Improved multiprocessor global schedulability analysis

Improved multiprocessor global schedulability analysis Iproved ultiprocessor global schedulability analysis Sanjoy Baruah The University of North Carolina at Chapel Hill Vincenzo Bonifaci Max-Planck Institut für Inforatik Sebastian Stiller Technische Universität

More information

Probability Distributions

Probability Distributions Probability Distributions In Chapter, we ephasized the central role played by probability theory in the solution of pattern recognition probles. We turn now to an exploration of soe particular exaples

More information

MULTIPLAYER ROCK-PAPER-SCISSORS

MULTIPLAYER ROCK-PAPER-SCISSORS MULTIPLAYER ROCK-PAPER-SCISSORS CHARLOTTE ATEN Contents 1. Introduction 1 2. RPS Magas 3 3. Ites as a Function of Players and Vice Versa 5 4. Algebraic Properties of RPS Magas 6 References 6 1. Introduction

More information

Feature Extraction Techniques

Feature Extraction Techniques Feature Extraction Techniques Unsupervised Learning II Feature Extraction Unsupervised ethods can also be used to find features which can be useful for categorization. There are unsupervised ethods that

More information

4 = (0.02) 3 13, = 0.25 because = 25. Simi-

4 = (0.02) 3 13, = 0.25 because = 25. Simi- Theore. Let b and be integers greater than. If = (. a a 2 a i ) b,then for any t N, in base (b + t), the fraction has the digital representation = (. a a 2 a i ) b+t, where a i = a i + tk i with k i =

More information

Boosting with log-loss

Boosting with log-loss Boosting with log-loss Marco Cusuano-Towner Septeber 2, 202 The proble Suppose we have data exaples {x i, y i ) i =... } for a two-class proble with y i {, }. Let F x) be the predictor function with the

More information

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2017 Lesson 1 4 October 2017 Outline Learning and Evaluation for Pattern Recognition Notation...2 1. The Pattern Recognition

More information

Sequence Analysis, WS 14/15, D. Huson & R. Neher (this part by D. Huson) February 5,

Sequence Analysis, WS 14/15, D. Huson & R. Neher (this part by D. Huson) February 5, Sequence Analysis, WS 14/15, D. Huson & R. Neher (this part by D. Huson) February 5, 2015 31 11 Motif Finding Sources for this section: Rouchka, 1997, A Brief Overview of Gibbs Sapling. J. Buhler, M. Topa:

More information

Graphical Models in Local, Asymmetric Multi-Agent Markov Decision Processes

Graphical Models in Local, Asymmetric Multi-Agent Markov Decision Processes Graphical Models in Local, Asyetric Multi-Agent Markov Decision Processes Ditri Dolgov and Edund Durfee Departent of Electrical Engineering and Coputer Science University of Michigan Ann Arbor, MI 48109

More information

Sharp Time Data Tradeoffs for Linear Inverse Problems

Sharp Time Data Tradeoffs for Linear Inverse Problems Sharp Tie Data Tradeoffs for Linear Inverse Probles Saet Oyak Benjain Recht Mahdi Soltanolkotabi January 016 Abstract In this paper we characterize sharp tie-data tradeoffs for optiization probles used

More information

ESTIMATING AND FORMING CONFIDENCE INTERVALS FOR EXTREMA OF RANDOM POLYNOMIALS. A Thesis. Presented to. The Faculty of the Department of Mathematics

ESTIMATING AND FORMING CONFIDENCE INTERVALS FOR EXTREMA OF RANDOM POLYNOMIALS. A Thesis. Presented to. The Faculty of the Department of Mathematics ESTIMATING AND FORMING CONFIDENCE INTERVALS FOR EXTREMA OF RANDOM POLYNOMIALS A Thesis Presented to The Faculty of the Departent of Matheatics San Jose State University In Partial Fulfillent of the Requireents

More information

Forbidden Patterns. {vmakinen leena.salmela

Forbidden Patterns. {vmakinen leena.salmela Forbidden Patterns Johannes Fischer 1,, Travis Gagie 2,, Tsvi Kopelowitz 3, Moshe Lewenstein 4, Veli Mäkinen 5,, Leena Salmela 5,, and Niko Välimäki 5, 1 KIT, Karlsruhe, Germany, johannes.fischer@kit.edu

More information

Testing Properties of Collections of Distributions

Testing Properties of Collections of Distributions Testing Properties of Collections of Distributions Reut Levi Dana Ron Ronitt Rubinfeld April 9, 0 Abstract We propose a fraework for studying property testing of collections of distributions, where the

More information

Chapter 6 1-D Continuous Groups

Chapter 6 1-D Continuous Groups Chapter 6 1-D Continuous Groups Continuous groups consist of group eleents labelled by one or ore continuous variables, say a 1, a 2,, a r, where each variable has a well- defined range. This chapter explores:

More information

Necessity of low effective dimension

Necessity of low effective dimension Necessity of low effective diension Art B. Owen Stanford University October 2002, Orig: July 2002 Abstract Practitioners have long noticed that quasi-monte Carlo ethods work very well on functions that

More information

arxiv: v2 [math.co] 3 Dec 2008

arxiv: v2 [math.co] 3 Dec 2008 arxiv:0805.2814v2 [ath.co] 3 Dec 2008 Connectivity of the Unifor Rando Intersection Graph Sion R. Blacburn and Stefanie Gere Departent of Matheatics Royal Holloway, University of London Egha, Surrey TW20

More information

L S is not p m -hard for NP. Moreover, we prove for every L NP P, that there exists a sparse S EXP such that L S is not p m -hard for NP.

L S is not p m -hard for NP. Moreover, we prove for every L NP P, that there exists a sparse S EXP such that L S is not p m -hard for NP. Properties of NP-Coplete Sets Christian Glaßer, A. Pavan, Alan L. Selan, Saik Sengupta January 15, 2004 Abstract We study several properties of sets that are coplete for NP. We prove that if L is an NP-coplete

More information

a a a a a a a m a b a b

a a a a a a a m a b a b Algebra / Trig Final Exa Study Guide (Fall Seester) Moncada/Dunphy Inforation About the Final Exa The final exa is cuulative, covering Appendix A (A.1-A.5) and Chapter 1. All probles will be ultiple choice

More information

Analyzing Simulation Results

Analyzing Simulation Results Analyzing Siulation Results Dr. John Mellor-Cruey Departent of Coputer Science Rice University johnc@cs.rice.edu COMP 528 Lecture 20 31 March 2005 Topics for Today Model verification Model validation Transient

More information

Uniform Approximation and Bernstein Polynomials with Coefficients in the Unit Interval

Uniform Approximation and Bernstein Polynomials with Coefficients in the Unit Interval Unifor Approxiation and Bernstein Polynoials with Coefficients in the Unit Interval Weiang Qian and Marc D. Riedel Electrical and Coputer Engineering, University of Minnesota 200 Union St. S.E. Minneapolis,

More information

CSE525: Randomized Algorithms and Probabilistic Analysis May 16, Lecture 13

CSE525: Randomized Algorithms and Probabilistic Analysis May 16, Lecture 13 CSE55: Randoied Algoriths and obabilistic Analysis May 6, Lecture Lecturer: Anna Karlin Scribe: Noah Siegel, Jonathan Shi Rando walks and Markov chains This lecture discusses Markov chains, which capture

More information

A Note on the Applied Use of MDL Approximations

A Note on the Applied Use of MDL Approximations A Note on the Applied Use of MDL Approxiations Daniel J. Navarro Departent of Psychology Ohio State University Abstract An applied proble is discussed in which two nested psychological odels of retention

More information

lecture 36: Linear Multistep Mehods: Zero Stability

lecture 36: Linear Multistep Mehods: Zero Stability 95 lecture 36: Linear Multistep Mehods: Zero Stability 5.6 Linear ultistep ethods: zero stability Does consistency iply convergence for linear ultistep ethods? This is always the case for one-step ethods,

More information

Understanding Machine Learning Solution Manual

Understanding Machine Learning Solution Manual Understanding Machine Learning Solution Manual Written by Alon Gonen Edited by Dana Rubinstein Noveber 17, 2014 2 Gentle Start 1. Given S = ((x i, y i )), define the ultivariate polynoial p S (x) = i []:y

More information

Probability and Stochastic Processes: A Friendly Introduction for Electrical and Computer Engineers Roy D. Yates and David J.

Probability and Stochastic Processes: A Friendly Introduction for Electrical and Computer Engineers Roy D. Yates and David J. Probability and Stochastic Processes: A Friendly Introduction for Electrical and oputer Engineers Roy D. Yates and David J. Goodan Proble Solutions : Yates and Goodan,1..3 1.3.1 1.4.6 1.4.7 1.4.8 1..6

More information

#A52 INTEGERS 10 (2010), COMBINATORIAL INTERPRETATIONS OF BINOMIAL COEFFICIENT ANALOGUES RELATED TO LUCAS SEQUENCES

#A52 INTEGERS 10 (2010), COMBINATORIAL INTERPRETATIONS OF BINOMIAL COEFFICIENT ANALOGUES RELATED TO LUCAS SEQUENCES #A5 INTEGERS 10 (010), 697-703 COMBINATORIAL INTERPRETATIONS OF BINOMIAL COEFFICIENT ANALOGUES RELATED TO LUCAS SEQUENCES Bruce E Sagan 1 Departent of Matheatics, Michigan State University, East Lansing,

More information

THE AVERAGE NORM OF POLYNOMIALS OF FIXED HEIGHT

THE AVERAGE NORM OF POLYNOMIALS OF FIXED HEIGHT THE AVERAGE NORM OF POLYNOMIALS OF FIXED HEIGHT PETER BORWEIN AND KWOK-KWONG STEPHEN CHOI Abstract. Let n be any integer and ( n ) X F n : a i z i : a i, ± i be the set of all polynoials of height and

More information

A Simple Regression Problem

A Simple Regression Problem A Siple Regression Proble R. M. Castro March 23, 2 In this brief note a siple regression proble will be introduced, illustrating clearly the bias-variance tradeoff. Let Y i f(x i ) + W i, i,..., n, where

More information

Energy-Efficient Threshold Circuits Computing Mod Functions

Energy-Efficient Threshold Circuits Computing Mod Functions Energy-Efficient Threshold Circuits Coputing Mod Functions Akira Suzuki Kei Uchizawa Xiao Zhou Graduate School of Inforation Sciences, Tohoku University Araaki-aza Aoba 6-6-05, Aoba-ku, Sendai, 980-8579,

More information

A Low-Complexity Congestion Control and Scheduling Algorithm for Multihop Wireless Networks with Order-Optimal Per-Flow Delay

A Low-Complexity Congestion Control and Scheduling Algorithm for Multihop Wireless Networks with Order-Optimal Per-Flow Delay A Low-Coplexity Congestion Control and Scheduling Algorith for Multihop Wireless Networks with Order-Optial Per-Flow Delay Po-Kai Huang, Xiaojun Lin, and Chih-Chun Wang School of Electrical and Coputer

More information

Physically Based Modeling CS Notes Spring 1997 Particle Collision and Contact

Physically Based Modeling CS Notes Spring 1997 Particle Collision and Contact Physically Based Modeling CS 15-863 Notes Spring 1997 Particle Collision and Contact 1 Collisions with Springs Suppose we wanted to ipleent a particle siulator with a floor : a solid horizontal plane which

More information

Compression and Predictive Distributions for Large Alphabet i.i.d and Markov models

Compression and Predictive Distributions for Large Alphabet i.i.d and Markov models 2014 IEEE International Syposiu on Inforation Theory Copression and Predictive Distributions for Large Alphabet i.i.d and Markov odels Xiao Yang Departent of Statistics Yale University New Haven, CT, 06511

More information