Structured Motifs Search

Size: px
Start display at page:

Download "Structured Motifs Search"

Transcription

1 Structured Motifs Search Technical Report UDIMI/15/2003/RR Michele Morgante 1, Alberto Policriti 2, Nicola Vitacolonna 2 and Andrea Zuccolo 1 University of Udine 1 Department of Crop Sciences and Agricultural Engineering 2 Department of Mathematics and Computer Science via delle Scienze 206, Udine (Italy) 19th September 2003 Abstract In this paper we describe an algorithm for the localization of structured models, i.e. sequences of (simple) motifs and distance constraints. It basically combines standard pattern matching procedures with a constraint satisfaction solver, and it has the ability, not present in similar tools, to search for partial matches. A significant feature of our approach, especially in terms of efficiency for the application context, is that the (potentially) exponentially many solutions to the considered problem are represented in compact form as a graph. Moreover, the time and space necessary to build the graph are linear in the number of occurrences of the component patterns. 1 Introduction A relevant task of computational biology consists in helping identify conserved features in a set of DNA or protein sequences. This definition actually covers a range of problems, from isolating previously unknown domains to finding occurrences of some kind of consensus model that has already been determined. In this paper, we assume that domains, signals, consensus models and so on can always be represented in a particular form which we will call, following the terminology used by other authors, structured motifs (or structured models), which can be thought of as compound patterns made of a list of motifs, or patterns, and a list of intervals that specify at what distances adjacent motifs should occur (see [4]). To help distinguish the two main classes of problems mentioned above, we will say that a de novo identification of signals in sequences is a problem of structured motif extraction, while the problem of finding the positions in a sequence where occurrences of a given model are is a problem of structured motif localization. The former has been widely investigated and there exist elaborate proposals in the literature ([1, 2, 4, 13, 14, 20] is an incomplete list). 1

2 We will focus our attention on the problem of localization. Since, from a syntactic point of view, structured motifs are a special kind of regular expressions 1, the naïve approach is to reduce the problem to a regular pattern matching problem ([4, 21]). However, given the simplified form structured motifs exhibit as regular expressions, this is not the best way to go. There are different approaches in the literature, notably [19] and [16, 17], which we will discuss in Section 3. In this paper we describe an algorithm for the localization of structured models, which can be thought of as compound patterns made of a list of simple motifs and a list of intervals that specify at what distances adjacent motifs should occur (see [4]). Our algorithm combines standard pattern matching procedures with a constraint satisfaction solver whose peculiar form allows us to represent the (potentially) exponentially many solutions as a graph in an efficient way. The features of our proposal include flexibility (any exact or approximate pattern/regular expression matching algorithm or alignment procedure can be used to search for simple motifs), the ability of searching for partial occurrences of a structured motif (i.e., the ones for which a limited number of component patterns is missing) and negative gaps (which imply potential overlapping of component patterns). As a byproduct, we obtain a data structure which can be used to store all relevant information about the occurrences of a structured model in a sequence. The possibility to produce the solution set in compact form is based on a view of the occurrences of a given motif as an interval of positions in the searched string. A number of simple results follow from this view and allow the design of fast manipulation procedures. Moreover, the compactness of the data structure and the size of the output can be reduced to linear, based on the above mentioned interval representation. The paper is organized as follows: in Section 2 we depict the context in which our algorithm was developed and for what purposes; Section 3 describes related approaches to the problem; Section 4 defines the basic notions needed further; in Section 5 the algorithm is described in full detail; Section 6 reports some experimental results; finally, Section 7 summarizes what we have done and gives some hint for future work. 2 The application context In many biological contexts, already available experimental data allow biologists to determine a more or less precise model of what they are searching for, for instance, by a multiple alignment of a set of homologous sequences. Such models are usually built on the ground of the analysis of a relatively small data set, and they are used to perform further searches in other sequences. Informally, such models can often be translated into structured motifs. The biological problem we initially faced was to automatically find a particular kind of transposable elements called LTR retrotransposons ([11], [12]) in the rice genome. LTR retrotransposons are characterized by an approximate direct repeat (the so called Long Terminal Repeats, LTRs for short) that varies in size from 100 base pairs to several kilobases. Therefore, one can try to locate retrotransposons by identifying such repeats. This can be done by programs like REPuter ([13]) or by more specific tools like LTR STRUC ([15]). 1 when distances between component patterns are nonnegative. 2

3 Sometimes, though, one or both LTRs are missing, so this approach cannot be pursued and it is necessary to search for conserved domains inside the retroelement. In one year of work, we have recollected experimental data from about 10% of the rice genome, allowing us to localize several LTR retrotransposons and to establish, by multiple alignment, a number of conserved features ([3, 22]). For example, many retrotransposons belonging to the Ty1-copia group contain, in correspondence to the gene encoding the reverse transcriptase, a match of the following structured motif: MT[115, 136]MTNTAYGG[121, 151]GTNGAYGAY, which consists of three patterns (MT, MTNTAYGG and GTNGAYGAY) written in the IUPAC alphabet, and two intervals imposing constraints on the relative distances between adjacent patterns. Typically, the component patterns of such structured motifs are from a few to a few hundreds bases long, and gaps can be even several dozens bases. Although patterns are usually specified using the IUPAC alphabet, an effective search requires allowing a positive (although in general very small) number of errors. But that is not enough yet: when searching for complex structured motifs it is likely that not all of their component patterns will be found in the searched sequence. Nonetheless, if enough patterns are present, or if the more important are found, we may say that an approximate occurrence of the structured motif has been found. This leads us to the possibility of missing patterns and to the introduction of weights modelling different significance of patterns together with some threshold check. In this paper, we limit ourself to consider the case without weights. 3 Related work Structured motifs are called Classes of Characters and Bounded Gaps (CBG) expressions in [19]. Although their definition is equivalent to the one used in this paper, the use of these expressions is quite different: in [19], the underlying motivation is giving an efficient algorithm for finding aminoacidic sequences in a database like PROSITE. A sequence of this kind is, for example, [RK]- x(2, 3)-[DE]-x(2, 3)-Y, where the brackets match any of the letters inside and x(2, 3) is a gap of length between 2 and 3. In practice, the considered CBG expressions are usually not very long: the parts between gaps typically consist of a few characters and distances also span a few letters. In their experiments, Navarro et al. used patterns for which the maximum length of a match did not exceed the number of bits of a computer word (typically, 32 or 64). The algorithm given in [19] is very efficient in solving the problem of finding such patterns, by making use of bit parallelism to implement a non-deterministic ε-automaton. But their algorithm becomes less efficient as gaps get longer, because the size of the automaton increases accordingly. Also, bit operations are more costly when they have to be done on several computer words instead of one. Myers et al. [16, 17] give an algorithm for finding network expressions with spacers, where a network expression is a regular expression without Kleene star and spacers are distance ranges between adjacent network expressions. Two 3

4 algorithms are described in [17]: the first is an approximate regular expression matching procedure based on the construction of an ε-automaton called an alignment graph; the second is a backtracking procedure with an optimization technique to determine the best (according to some statistical criterion) backtrack order. The algorithm allows negative gaps, e.g. M 1 5, 9 M 2 means that the left end of an occurrence of motif M 2 must be found between 5 symbols to the left and 9 symbols to the right of the right end of an occurrence of motif M 1. As Myers et al., we use a separate subroutine in order to search for component motifs but our approach differs from the one presented in [17] in that we do not use a backtracking strategy to find structured motifs instead, we build and solve a constraint satisfaction problem. Constraint solving techniques have already been employed to the search of patterns in biosequences, e.g. [7]. However, as we only consider distance contraints, the CSP we obtain is simpler than most other proposals in the literature, and the method we use to solve it is more specialized. 4 Preliminary notions Biologists and computer scientists use different terms to denote the same concepts. We will consider character, nucleotide, base and symbol to be equivalent, and we will use the term motif as a synonym for pattern. When we want to stress that we are talking of a motif/pattern which is part of a structured motif, we will talk of simple motifs or component patterns. For the sake of simplicity, we restrict ourselves to DNA, but what we say can be applied to RNA and proteins as well. Let Σ be a finite set of symbols, called the alphabet. A string (or sequence) S = s 1 s n over Σ is a finite sequence of symbols in Σ. The length of S is S = n. With Σ we denote the set of all strings over Σ, including the empty string ε (whose length is ε = 0). A language is a subset of Σ. A string F is a factor of another string S = s 1 s n if F = s i s j for some 1 i j n. The empty string ε is a factor of any string. Throughout this paper, we fix Σ DNA = {a, c, g, t}, i.e. the alphabet of DNA, and Σ IUPAC = {A, C, G, T, R, Y, W, S, M, K, H, B, V, D, N}, i.e. the standard IUPAC alphabet used to denote nucleic acids. The interpretation of the different symbols in Σ IUPAC is shown in Table 1. We will assume that the text to be searched is a string over Σ DNA. Patterns can be viewed as specifications of languages. In the simplest case a pattern defines only one sequence of characters, i.e. the singleton set containing the string itself. Finite sets of strings can be specified in the standard IUPAC alphabet, e.g. AWTR defines the set {aata, aatg, atta, attg}, and by network expressions ([17]), which are obtained by concatenation and union of symbols, e.g. (ag tt) c corresponds to {agc, ttc}. Another possibility consists in specifying an aminoacidic pattern to be searched for in a DNA sequence: for example, the polypeptide QF (a glutamine followed by a phenylalanine) defines the set of nucleotide sequences {caattt, caattc, cagttt, cagttc}. Let L(P ) be the language generated by P. Given a sequence S = s 1 s n Σ DNA and a pattern P such that L(P ) Σ DNA, we say that a factor F of S exactly matches P if F L(P ). For instance, aata exactly matches the different patterns aata, AWTR and (aa ag) t (a c). 4

5 A {a} adenine C {c} cytosine G {g} guanine T {t} thymine R {a, g} purine Y {t, c} pyrimidine W {a, t} adenine or thymine S {g, c} guanine or cytosine M {a, c} adenine or cytosine K {g, t} guanine or thymine H {a, t, c} adenine, thymine or cytosine B {g, c, t} guanine, cytosine or thymine V {g, a, c} guanine, adenine or cytosine D {g, a, t} guanine, adenine or thymine N {g, a, t, c} guanine, adenine, thymine or cytosine Table 1: The IUPAC alphabet for nucleic acids. To define inexact matching, suppose that a scoring function δ : Σ DNA Σ DNA R, expressing the degree of similarity between two sequences, is given. Then, we say that a factor F of S approximately matches a pattern P within a fixed threshold t R if there exists F L(P ) such that δ(f, F ) t. A common choice is to define δ(f, F ) as the minimum number of insertions, deletions and substitutions to be performed in order to obtain F from F, that is as the edit (or Levenshtein) distance between F and F. A structured motif, or model, is a pair (P, I), where P = (P 1,..., P k ) is a sequence of patterns and I = ([l 1, u 1 ],..., [l k 1, u k 1 ]), with l i, u i Z and l i u i for 1 i < k, is a sequence of (closed) intervals (see [4]). A structured motif is usually written as an expression of the form P 1 [l 1, u 1 ]P 2 [l 2, u 2 ]P 3 P k 1 [l k 1, u k 1 ]P k. An exact match of (P, I) in a sequence S is a k-tuple (F 1,..., F k ) of factors of S such that F i matches P i, for 1 i k, possibly according to different matching criteria; the distance d i between the end position of F i and the start position of F i+1 is such that l i d i u i, for 1 i < k. Note that, although the structured motif is said to occur exactly, some or even all the component patterns may occur approximately. If the structured motif has many component patterns, it is realistic to assume that some of them do not occur in the scanned sequence, at least at the right positions. It is important, then, to be able to search for partial occurrences. To do that, we must impose constraints on non-adjacent patterns. We discuss how such constraints can be reasonably derived from the definition of a structured motif in Section 5.2. For now, we assume that for each pair of simple motifs P i and P j a constraint C(i, j) on their relative distance can be defined. So, given 5

6 an integer q such that 1 q k, an approximate q-occurrence of (P, I) in a sequence S is a q-tuple (F 1,..., F q ) of factors of S such that there is a sequence of indices 1 j 1 < < j q k such that F i matches P ji, for 1 i q, possibly according to different matching criteria; the distance d i between the end position of F i and the start position of F i+1 is such that the constraint C(j i, j i+1 ) is satisfied, for 1 i < q. The problem discussed in Section 2 is formalized as follows: given a structured model (P, I) with k component patterns, a sequence S Σ DNA and an integer q such that 1 q k, find all q-occurrences of (P, I) in S. In what follows, and in the implementation of our algorithm, we consider patterns over Σ IUPAC, matched either exactly or within a given edit distance. This is no loss of generality, since our algorithm can be easily generalized. A partial justification of our choice is given in Section 6. 5 The algorithm The algorithm we propose is a two-step procedure: 1. find the occurrences of all the component patterns of the structured model; 2. combine the occurrences that satisfy the distance constraints into a structured motif. The first step reduces to using some standard pattern matching algorithm. The second step amounts to the definition of a Constraint Satisfaction Problem (CSP) over finite domains. The CSP is solved by building and manipulating a directed acyclic graph of the occurrences. 5.1 Searching for simple motifs Since efficient algorithms exist for many of the pattern matching problems discussed in Section 4, we can start searching for all the component patterns of (P, I) independently, and then combine those occurrences which fit the model. Given a sequence S = s 1 s n and a structured motif (P, I), the first step of our algorithm is: find all the occurrences of P 1,..., P k in S with edit distances e 1,..., e k, respectively. We assume that the output of the search is the list, sorted by start position, of the occurrences of the P i s, where each occurrence is a pair (s, f) representing its start and its end position. For convenience, we define the end position f to be one character beyond the last character of a match. The length of an occurrence of P i can range from P i e i to P i + e i. Table 2 shows the result of searching for the component patterns of a structured motif in the sample sequence of Figure 1. We say that a match of a component pattern P i is contained into another if there are two occurrences (s, f) and (s, f ) of P i such that s s < f < f. A match to P i can be contained into another only if P i is searched for with e i > 0. So forth, we assume that the search for simple motifs never produces an 6

7 T G T a a a a C T g g T G T C T c a G G a G G C T c g C T c c G G a T A G G a A C A a A C A a a A C A a a a a a Figure 1: A sample DNA sequence. The parts of the sequence matching some component pattern of the structured model in Table 2 are shown with upper-case letters. tgt r 11 = (1, 4) r 12 = (12, 15) ct r 21 = (8, 10) r 22 = (15, 17) r 23 = (24, 26) r 24 = (28, 30) gg r 31 = (19, 21) r 32 = (22, 24) r 33 = (32, 34) r 34 = (37, 39) ta r 41 = (35, 37) aca r 51 = (40, 43) r 52 = (44, 47) r 53 = (49, 52) Table 2: The result of the exact search for the component patterns of the structured model tgt[4, 20]ct[2, 5]gg[0, 1]ta[2, 15]aca in the sequence shown in Figure 1. occurrence that is contained into another. In particular, no two matches of the same pattern can have the same start position. The reason for this technical condition is explained in Section 5.3. The complexity of this step depends on which kind of search is required. When e i = 0 for all i, we may apply the Aho-Corasick algorithm, which runs in linear time. The same complexity can be achieved by preprocessing the sequence to build a suffix tree, which is what we have done in our implementation (see [10] for a thorough discussion of these topics). If e i > 0 for some i, some approximate pattern matching procedure is needed. Multiple approximate string matching algorithms exist that achieve optimal average performance (see [8]). Finally, the restriction that the output be sorted usually does not require further computations, because most algorithms already produce a sorted output. 5.2 Building the constraint graph The more interesting part of our procedure is relative to the way in which we can combine different occurrences of simple motifs into a valid occurrence of the structured model. In order to do this, we must build and solve a constraint satisfaction problem (CSP), which, in general, is constituted by a finite sequence of variables X x 1,..., x k with respective domains D 1,..., D k, together with a finite set C of constraints, each on a subsequence of X. We will need the following notion: given a subsequence X of X, the subset of constraints in C in which only variables of X occur is called the projection of C on X, denoted C[X ]. Let D i = {(s i1, f i1 ),..., (s imi, f imi )}, for 1 i k, be the set of the start and end positions of the found occurrences of P i. A constraint system with k variables x 1,..., x k, such that x i D i, can be defined as follows. Adjacent patterns are constrained as specified by the structured model: for 1 i < k we 7

8 must have l i (x i+1 ) 1 (x i ) 2 u i, (1) where ( ) 1 and ( ) 2 are the first and the second component of a pair, respectively. We denote the set of these constraints, for 1 i < k, by C Adj. As we admit the possibility of missing at most k q patterns, we must derive constraints on non-adjacent patterns. How to do this is a matter of biological considerations rather than algorithmic ones. The question is: why is a pattern not found? One answer may be that the evolutionary process caused that pattern to be deleted from the sequence, thus shortening it. Another possibility is that the pattern degenerated by mutations or our search criteria are too stringent: so, the pattern exists, but we are not able to detect it. To take into account both possibilities, we choose to generalize the interval condition to possibly non-adjacent patterns P i and P i+h, for 1 i < k and 1 h min{k i, k q + 1}, as follows: i+h 1 r=i i+h 1 l r (x i+h ) 1 (x i ) 2 u i + r=i+1 ( P r + e r + u r ). (2) where P r + e r is the maximum length of a match of P r according to the model. We denote the set of these constraints by C. When h = 1 constraint (2) reduces to (1), so C Adj C. Besides, given a sequence of variables x i, x i+1,..., x i+j 1, x i+j, for some i and j, if C Adj [x i,..., x i+j ] is satisfied by some tuple (r i,..., r i+j ) D i D i+j, then each C[x i1, x i2 ], with i i 1 < i 2 i + j, is automatically satisfied 2 by the values r i1 and r i2. Note that, for any distinct pair of variables, there is at most one constraint involving those variables. The problem of finding q-occurrences of a given structured model is equivalent to finding the solutions of all the C[X] s such that X is a subsequence of variables x i1,..., x il, with l q. CSPs similar to this one, but over continuous intervals, have been investigated in [5], where they are called simple temporal problems (STPs). The approach in [5], however, does not work when finite domains are considered in place of intervals. The special form of our constraints over finite domains allows for a more efficient solution. Let M = k i=1 m i be the total number of occurrences of all the component patterns and let r ij D i denote the jth occurrence of the ith pattern, for 1 i k and 1 j m i. The constraint graph for C is a directed acyclic graph with M nodes labelled by the r ij s such that there is an edge from r ij to r (i+h)l, with 1 h min{k i, k q+1} and 1 l m i+h, if and only if the constraint C[x i, x i+h ] is satisfied when instantiated with the values r ij and r (i+h)l. The set of nodes of the graph can be partitioned into k layers D 1,..., D k, each one corresponding to the (ordered) set of occurrences of a component pattern. To emphasize that, we denote the graph with ((D 1,..., D k ), E), where k i=1 D i is the set of vertices and E is the set of edges. The edges connect only vertices in layers D i and D j where 1 j i k q + 1, and they are always directed towards a layer with greater index. We denote an edge from node r to node r with r r. Figure 2 shows an example of such graph. 2 We assume that the empty set of constraints is trivially satisfied. 8

9 The nodes can be totally ordered according to the start positions of the labelling occurrences, but for our purposes it is more useful to restrict such ordering to nodes in the same layer. So, we define a partial ordering over k i=1 D i { } as follows: given two nodes r 1 = (s 1, f 1 ) and r 2 = (s 2, f 2 ), we say that r 1 precedes r 2, written r 1 r 2, if r 1 and r 2 belong to the same layer and s 1 < s 2. Besides, r for every r. We write r 1 r 2 if either r 1 r 2 or r 1 = r 2. The immediate successor of a node r with respect to is written succ(r). We define succ( ) =. To build the graph it is not necessary to store all the edges explicitly. If r r 1, r r 2 E are two edges of the constraint graph and r 1 r 2, then for all nodes r 3 such that r 1 r 3 r 2 we have r r 3 E (see Lemma 5.1). This suggests that every node needs to have at most 2(k q+1) outgoing edges to the successive k q +1 layers. For each layer, the first edge goes into the occurrence with the lowest start position that satisfies the corresponding constraint and the second edge goes into the occurrence with the highest start position that satisfies the constraint. We call these two edges the low edge and the high edge, respectively. The remaining edges are implicitly determined. Given a low edge r r, where r D i and r D i+j, we denote r with r.low j and we say that r r is the low j-edge of r. Similarly, if r r is a high edge, we denote r with r.high j, and we say that r r is the high j-edge of r. If a node r has no outgoing j-edges for some j, we define r.low j = r.high j =. We call the representation of the graph by means of low and high edges the implicit representation of the (explicit) constraint graph (see Figure 3). We denote the set of low and high edges with Ẽ. The construction of the implicit graph can be done in O(M) time and space if the occurrences of the P i s are output in increasing order by the pattern matching algorithms (which is usually the case). Under the assumption that the nodes in each layer are sorted, low and high edges between layers D i and D i+h can be determined in O(m i + m i+h ) time by making two linear searches: the first search determines all the low edges and the second determines all the high edges. Algorithm 1 gives a description of the procedure. The two innermost for loops (lines 9 17 and 20 29) take O(m i + m i+h ) time for each pair of layers D i and D i+h to add O(m i ) edges. The search must be done for k q+1 h=1 (k h) pairs of layers. The overall time complexity in the worst case is therefore O ( k q+1 k h h=1 i=1 (m i +m i+h ) ) = O ( 2 k q+1 k h=1 i=1 m ) i = O(2(k q +1)M) = O(M). The total number of edges is bounded by 2(k q + 1)M, so the space requirement is also O(M). 5.3 Some properties of the constraint graph Given two nodes r 1 and r 2, the interval [r 1, r 2 ] is the set of nodes [r 1, r 2 ] = { r r 1 r r 2 }. The nodes r 1 and r 2 are the endpoints of the interval. Given a node r D i and an integer j, the interval int j (r) = [r.low j, r.high j ] is said to be induced by r. If r has no outgoing j-edges, then int j (r) = [, ] =. Of course, each node induces at most k q + 1 non-empty intervals. The basic property of induced intervals is expressed by the following simple lemma. Lemma 5.1 Let r D i. For each r int j (r), the constraint C[x i, x i+j ] is satisfied when instantiated with (r, r ). 9

10 Algorithm 1 Constraint Graph Construction Require: occurrences in each D i must be sorted by increasing start position. 1: Build-Constraint-Graph ( { P i, e i, D i } 1 i k, {[l i, u i ]} 1 i k 1, q ) 2: Let (D 1,..., D k ) be the (layered) set of nodes of the graph 3: Ẽ 4: for i 1 to k 1 do {For each layer but the last} 5: for h 1 to min{k i, k q + 1} do {and for each reachable layer} 6: Let L and U be the lower and upper bound, respectively, on the relative distance between two occurrences of P i and P i+h 7: {Determine low edge target} 8: l 1 {Low edge pointer} 9: for j 1 to m i do {For each node in layer D i } 10: while l m i+h and s (i+h)l f ij < L do 11: l l : end while 13: if l m i+h and s (i+h)l f ij U then 14: {Add low edge to (i + h)th layer} 15: Ẽ Ẽ {(s ij, f ij ) (s (i+h)l, f (i+h)l )} 16: end if 17: end for 18: {Determine high edge target} 19: u m i+h {High edge pointer} 20: for j m i to 1 do 21: while u > 0 and s (i+h)u f ij > U do 22: u u 1 23: end while 24: if u > 0 and s (i+h)u f ij L then 25: {Add high edge to (i + h)th layer} 26: Ẽ Ẽ {(s ij, f ij ) (s (i+h)u, f (i+h)u )} 27: end if 28: end for 29: end for 30: end for 31: return the implicit constraint graph G = ( (D 1,..., D k ), Ẽ) Proof. By construction, the pairs r, r.low j and r, r.high j both satisfy C[x i, x i+j ], i.e. they satisfy (2). By hypothesis, r.low j r r.high j. So we have (r.low j ) 1 (r) 2 (r ) 1 (r) 2 (r.high j ) 1 (r) 2. Therefore, the pair (r, r ) satisfies C[x i, x i+j ], too. We define two relations over intervals: given I 1 = [r 1, r 2 ] and I 2 = [r 3, r 4 ], we say that I 1 is before I 2, written I 1 B I 2, if r 2 r 3 ; we say that I 1 overlaps I 2, written I 1 O I 2, if r 1 r 3 r 2 r 4. According to this definition, the overlapping relation is neither symmetric nor transitive, but it is reflexive. The following lemma expresses an important property of the graph, which we extensively make use of in the development of our algorithms. 10

11 r 11 r 12 D 1 r 21 r 22 r 23 r 24 D 2 r 31 r 32 r 33 r 34 D 3 r 41 D 4 r 51 r 52 r 53 D 5 Figure 2: Table 2. The explicit constraint graph for q = 4, derived from the example in r 11 r 12 D 1 r 21 r 22 r 23 r 24 D 2 r 31 r 32 r 33 r 34 D 3 r 41 D 4 r 51 r 52 r 53 D 5 Figure 3: The implicit constraint graph, where non-feasible nodes have been grayed. When a low and a high edge of a node coincide, only one edge has been drawn. Lemma 5.2 Suppose that no match of a component pattern is contained into another. Let r 1 and r 2 be two nodes of the constraint graph such that r 1 r 2. If, for a given j, int j (r 1 ) then either int j (r 1 ) O int j (r 2 ) or int j (r 1 ) B int j (r 2 ). Proof. If r 1 = r 2 then int j (r 1 ) O int j (r 2 ) for any j. So, suppose that r 1 r 2. If int j (r 2 ) =, then int j (r 1 ) B int j (r 2 ). Otherwise, by hypothesis, we have L (r 1.low j ) 1 (r 1 ) 2 (r 1.high j ) 1 (r 1 ) 2 U and L (r 2.low j ) 1 (r 2 ) 2 (r 2.high j ) 1 (r 2 ) 2 U, where L and U are computed as specified by (2). Let r 1 = (s 1, f 1 ), r 2 = (s 2, f 2 ), r 1.low j = (s 3, f 3 ), r 1.high j = (s 4, f 4 ), r 2.low j = (s 5, f 5 ) and r 2.high j = (s 6, f 6 ). The above inequalities can be rewritten as L s 3 f 1 s 4 f 1 U and (3) L s 5 f 2 s 6 f 2 U. (4) 11

12 Since no match is contained into another, we also have f 1 f 2. Therefore, L s 5 f 1. Since (s 3, f 3 ) is, by hypothesis, the least occurrence satisfying the lower bound in (3), we must conclude that s 3 s 5, so r 1.low j r 2.low j. A similar reasoning applies to s 4 and s 6 : from (3) and by using the hypothesis, we get s 4 f 2 U. Since (s 6, f 6 ) is the maximum occurrence satisfying the upper bound in (4), we must conclude that s 4 s 6, and so r 1.high j r 2.high j. Therefore, if r 2.low j r 1.high j, we have int j (r 1 ) O int j (r 2 ), otherwise int j (r 1 ) B int j (r 2 ). Corollary 5.3 Suppose that no match of a component pattern is contained into another. Let r 1 and r 2 be two nodes of the constraint graph such that r 1 r 2. If r 1 has outgoing j-edges, then r 1.low j r 2.low j and r 1.high j r 2.high j. Correctness of Algorithm 1 is based on Corollary 5.3. It means that, in order to draw low (resp., high) j-edges, it is sufficient to scan the nodes located j layers below from left to right (resp., from right to left), without ever going back, i.e. a linear search will do. It is less obvious that also the nodes whose edges go into a given node r form an interval. This is what the following lemma states. Lemma 5.4 Suppose that no match of a component pattern is contained into another. Let r 1, r 2 and r 3 be three nodes such that r 1 r 2 r 3. If int j (r 1 ) and int j (r 1 ) O int j (r 3 ), then int j (r 2 ), int j (r 1 ) O int j (r 2 ) and int j (r 2 ) O int j (r 3 ). Proof. By Corollary 5.3, it is sufficient to prove that int j (r 2 ). Let r 1 = (s 1, f 1 ), r 2 = (s 2, f 2 ), r 3 = (s 3, f 3 ) and r 3.low j = (s 4, f 4 ). By hypothesis, we have L s 4 f 3, where L has been computed according to the leftmost sum in (2). As the intervals induced by r 1 and r 3 overlap, then r 3.low j int j (r 1 ), so, by Lemma 5.1, L s 4 f 1. Moreover, f 1 f 2 f 3, because no match is contained into another. Thus, L s 4 f 2. A similar reasoning allows us to prove that the upper bound is also satisfied. Then, r 2 must have a j-edge to r 3.low j, that is int j (r 2 ). Given a node r D i, let parent j (r) = { r r D i j r r E } be the set of nodes in layer D i j having an outgoing edge entering r (in the explicit graph). We define parent j (r) = if j i. We call parent j (r) the parent (j-)interval of r. The name is justified by the following corollary. Corollary 5.5 For any node r and integer j, parent j (r) is an interval. In a way similar to what we have done for low and high edges, one could also determine the endpoints of such intervals and store the corresponding edges delimiting parent intervals. It is not difficult to see that the properties we have proved for induced intervals hold for parent intervals, too. 5.4 How to output all the solutions Once the graph has been built, in order to get the solutions, i.e. all q-occurrences of the structured motif, it is sufficient to output all the paths (of the explicit 12

13 graph) whose length is at least q 1. Every such path corresponds to a valid match of the structured model. The problem of what to report when searching for complex patterns such as regular expressions has been faced by Myers et al. in [18]. In principle, the algorithms they propose can be applied also to the case of structured motifs (which are but a restricted form of regular expressions). Using the terminology of Myers et al., component patterns can be seen as the tagged subexpressions of the translation of the structured motif into a regular expression. The problem with the above mentioned approaches is that, in the worst case, the number of possible matches is exponential in the number k of component patterns. Although the constraint graph is, in practical cases, sparse and so computing all the paths is not impractical we can definitely do better than that. The idea is that we can give a suitable transformation of the graph as a convenient and compact output, which can be computed in time proportional to the size of the graph. Let us fix some terminology and notation. A source is a node that, in the explicit version of the graph, has no incoming edges. A leaf is a node that has no outgoing edges. We denote a path from r to r with r r. In the following, we always refer to paths in the explicit representation of the graph. Given a node r, let L r be the length of the longest path from a source to r, and let L r be the length of the longest path from r to a leaf (the length of a path being the number of its edges). A source is any node for which L r = 0. Similarly, a leaf is any node for which L r = 0. We say that a node is feasible if L r + L r q 1. A feasible node represents an occurrence of a component pattern that certainly belongs to an approximate or exact match to the structured motif. Conversely, a node that is not feasible corresponds to an occurrence that cannot belong to any solution. If we are able to modify the low and high edges in a way that they always point to feasible nodes, that is if we are able to shrink each induced interval [r 1, r 2 ] to a maximal subinterval [r 1, r 2] such that r 1 and r 2 are feasible nodes, then the solution can be implicitly characterized by the subgraph of the modified graph restricted to the set of feasible nodes. Indeed, this is true only if q = k: if q < k, not necessarily every path from a source to a leaf spells a valid match, so some other manipulation of the graph is mandatory. The first problem to solve is how to compute L r and L r. A naïve approach consists in propagating the values from one layer to another following induced intervals or parent intervals, respectively, i.e. in making a breadth first visit. But this is quadratic in the number of nodes. However, we can exploit the fact that valid paths have a limited range of lengths to avoid unnecessary updates. For a feasible r D i, it must be max{0, i 1 (k q)} L r i 1 and max{0, q i} L r k i. We call such values the feasible values for nodes in layer D i. Now, suppose that all the lengths are initialized to zero. The values associated with a node need to be updated only if they can receive a greater and feasible value. If a node r is assigned only feasible values, L r and L r must be updated at most a constant number of times, namely k q +1 times, because this is the number of different feasible values for each length. Algorithm 2 loops through all pairs of connected layers to compute the L r s bottom up. For each layer D i, starting from the last layer but one and going upwards, the adjacency sets of nodes r D i are examined and the maximum length value is determined. The adjacency sets are scanned by going through 13

14 layers D i+1,..., D i+k q+1 sequentially. For each pair D i and D i+j the procedure updates L r for each r D i if necessary, that is if the maximum of the length values in int j (r) is feasible and greater than L r (lines 15 16). Nodes in layer D i+j are scanned from left to right, and two pointers are kept: rightend is the first node in D i+j not already scanned, and rmax is the rightmost node (i.e., the maximum node with respect to ) in the interval int j (r) having the maximum length value if such value is feasible, otherwise it coincides with rightend. Of course, it is always rmax rightend. The inner if clause (lines 10 14) avoids redundant computation by skipping the interval of already scanned nodes whose maximum length value has already been computed. By doing so, nodes less than rmax are no longer taken into consideration. Let I 1 and I 2 be two induced intervals such that I 1 O I 2, and suppose that max and rmax have been computed for I 1. If rmax I 2, then the maximum length value in I 1 I 2 is max, so we only need to check the nodes in I 2 \ I 1. If rmax I 2, we must scan [succ(rmax), ] I 2 again, but we know that the maximum value in I 1 I 2 must be strictly less than max. So, if rmax for I 2 happens to be in I 1 I 2, then the corresponding max must be less than the previous value. This guarantees that a node is never scanned more than k q + 1 times. Thus, for each pair of layers D i and D i+j, the time needed to update the values in D i is O(m i + (k q + 1)m i+j ). This work must be done for k q+1 j=1 (k j) pairs of layers. Hence, the total time required by k j i=1 (m i + (k q + 1)m i+j ) ) = O(M). Compute-L r is O ( k q+1 j=1 The L r s can be computed by a symmetric procedure using parent intervals instead of induced intervals. However, since storing parent intervals doubles the required space for the graph, one can also devise a procedure propagating L r from each r to its induced intervals. Algorithm 3 shows how this can be done in linear time when q = k. The extension to the general case q k requires the use of k q + 1 pointers to nodes instead of only one, but the principle is the same. At the ith iteration, the variable ptr contains the first node in D i+1 that has not been scanned yet. Such pointer is used to avoid assignments to already examined nodes (lines 9 10). Thus, the time complexity of Compute- L r is O(M). As before, allowing q < k only adds a constant factor to such bound. After having computed path lengths, the graph can be restricted to feasible nodes. If a low edge enters a node that is not feasible, that low edge must be moved to the right, if possible. The same holds for a high edge, the only difference being that it must be moved to the left. In other words, we have to restrict the interval induced by a node to its maximal subinterval having feasible endpoints. Moving a low (resp., high) edge from a node to its successor (resp., predecessor) corresponds to deleting one edge in the explicit constraint graph. Such edge connects two occurrences of component patterns that locally satisfy their distance constraint, but one of them does not belong to any match of the structured motif. Algorithm 4 shows how this operation can be done for low edges. When a layer is processed and a feasible node r is encountered, all low edges pointing to non-feasible nodes before r are moved to target r. Such low edges can be easily retrieved if k q +1 pointers are maintained, each one performing a linear scan of one of the k q + 1 layers above the current layer. The overall time complexity is therefore O ( k i=2 (m i + k q+1 j=1 m i j ) ) = O(M). High edges can 14

15 Algorithm 2 Compute L r for all nodes r 1: Compute-L r ( G) 2: for all nodes r k i=1 D i do 3: L r 0 4: end for 5: for i k 1 to 1 do {For each layer but the last, in decreasing order} 6: for j 1 to min{k i, k q + 1} do {and for each next feasible layer} 7: rmax rightend the first node in D i+j 8: for all nodes r D i, in increasing order, do 9: if int j (r) then {If r has j-edges} 10: if rmax int j (r) then 11: let max be the maximum among L rmax and the values in int j (r) [rightend, ] 12: else 13: let max be the maximum among the values in int j (r) 14: end if 15: if max is a feasible value then 16: L r max{l r, max + 1} 17: rmax the rightmost node r int j (r) such that L r = max 18: else {Skip non feasible nodes} 19: rmax succ(r.high j ) 20: end if 21: rightend succ(r.high j ) 22: end if 23: end for 24: end for 25: end for Algorithm 3 Compute L r for all nodes r when q = k Require: q = k 1: Compute-L r ( G) 2: for all nodes r k i=1 D i do 3: L r 0 4: end for 5: for i 1 to k 1 do {For all layers, but the last, from top to bottom} 6: Let ptr be the first node in layer D i+1 7: for all nodes r D i, in increasing order, do 8: if L r = i 1 then {If r has a feasible length value} 9: for all nodes r [ptr, ] int 1 (r) do 10: L r i 11: end for 12: ptr succ(r.high 1 ) 13: end if 14: end for 15: end for 15

16 be processed by a symmetric procedure, examining nodes in decreasing order. Algorithm 4 Shrink induced intervals to have feasible left endpoints 1: Adjust-Low-Edges( G) 2: for i 2 to k do {For all layers but the first, in increasing order} 3: Let ptr be the first node in D i 4: for all nodes r D i, in increasing order, do 5: if r is feasible then 6: for all induced intervals [r 1, r 2 ] with ptr r 1 r r 2 do 7: replace [r 1, r 2 ] with [r, r 2 ] 8: end for 9: ptr succ(r) 10: end if 11: end for 12: change all remaining induced intervals to [, ] 13: end for r 11 r 12 D 1 r 22 r 24 D 2 r 32 r 33 D 3 r 41 D 4 r 51 r 52 r 53 D 5 Figure 4: The graph of Figure 3 restricted to feasible nodes. Let G = ( D 1,..., D k ), Ẽ ) be the implicit constraint graph after the execution of the previous algorithms, where each D i D i is the subset of feasible nodes of D i. If q = k, then every node in each layer induces a non-empty interval in the next layer. Therefore, all (explicit) paths starting from a node in the first layer reach a node in the last layer traversing all layers, i.e. each path represents a valid match of the structured motif. So, G is a compact representation of all the solutions, which can be given as a suitable output. If q < k, two situations must be considered. First, an edge can join two feasible nodes, but the corresponding occurrences must not necessarily belong to the same match of the structured motif (as for r 12 r 32 in Figure 4). Such an edge must be deleted. Second, although a valid path always exists that passes through a feasible node, not necessarily all the paths are long enough (see Figure 4). To detect these situations a visit of the graph is done. The kind of visit we need is a slightly modified depth first search, described in Algorithm 5. When a node belongs to a path that is not long enough (e.g., in Figure 4, r 33 belongs to r 12 r 33 r 51 ) but its path edges should not be deleted because 16

17 they are part of feasible paths (e.g., r 12 r 33 belongs to r 12 r 33 r 41 r 51 and r 33 r 51 belongs to r 12 r 24 r 33 r 51 ), that node must be cloned in order to eliminate the unfeasibility. Since the path lengths are in a limited range, a node must be duplicated at most a constant number of times. Let G = (V, E ) be the graph produced by Constraint-Graph-Visit. The nodes in G are pairs (r, l) where r is a node of G and l is the length of a path reaching r from a source. When a node r is visited through a path of length l, a node r adjacent to r is visited only if the current path can be extended to a feasible path by going through r (line 4 in procedure Visit) and if r has not yet been reached by a path of length l + 1 (lines 5 6). So, a node can be visited (and then duplicated) as many times as the number of paths with different lengths that enter that node, that is at most k q + 1 times. Therefore, the number of nodes in the output graph is still O ( k i=1 i) D (which is O(M), although, in typical practical situations, k i=1 D i M), and the time complexity of the visit is asymptotically the same as a standard depth first search, that is, in the worst case, O ( ( k i=1 D i )2). This is the only quadratic step in our method. Finally, Algorithm 5 reduces to a standard depth first search of G when q = k. The conditional statement in line 4 of procedure Visit avoids following edges that lead to paths that are necessarily too short, so it permits to eliminate all non-feasible paths. It can be easily verified that, for every path r 1 r 2 r n from a source to a leaf of (the explicit version of) G, there are two possibilities: either n < q, in which case there is no corresponding path in the output graph, or n q, in which case there is a path (r 1, 0) (r 2, 1) (r n, n) in the output graph. Figure 5 shows the output of our running example. Algorithm 5 Visit of the modified graph restricted to feasible nodes 1: Constraint-Graph-Visit( G ) 2: V E 3: for all sources s do 4: Visit(s, 0) 5: end for 6: return (V, E ) 1: Visit(r, l) 2: V V {(r, l)} 3: for all nodes r k q+1 j=1 int j (r) do {Explore r s adjacency set} 4: if l L r q 1 then {If there is a feasible path from r } 5: if (r, l + 1) V then 6: Visit(r, l + 1) 7: end if 8: E E {(r, l) (r, l + 1)} 9: end if 10: end for 17

18 40, ACA 12, TGT 28, CT 32, GG 35, TA 44, ACA 49, ACA 44, ACA 1, TGT 15, CT 22, GG 32, GG 35, TA 49, ACA 40, ACA Figure 5: The output graph derived from the visit of the graph in Figure 4. 6 Experimental results We have written a program, called SMaRTFinder 3. We have implemented an exact pattern matching algorithm using Kurtz s implementation of suffix trees [9], and Sellers algorithm for approximate pattern matching ([10]). Both algorithms can work with the IUPAC alphabet 4. The program is written in standard C++ (apart from Kurtz s code, which is in C). Since Sellers algorithm is not optimal, being quadratic in the length of the text to be scanned, we did not include it in our tests. We plan to use more efficient algorithms in the future. The following test was executed on a PowerPC G4 400Mhz machine with 384Mb RAM running Mac OS X. The program was compiled with g++ v3.3 with the option -O3. A set of 1000 structured models over Σ IUPAC were randomly generated by randomly choosing, for each model, the number k [3, 8] of component patterns, the length l [5, 10] of each component pattern and k 1 subintervals of [0, 100] as gaps. We measured the performance of SMaRTFinder over such sample by processing a 5Mb DNA sequence belonging to chromosome I of Arabidopsis Thaliana to search for k-occurrences of the structured motifs with no errors. Since Kurtz s suffix trees can be built either lazily or eagerly, we tested both cases. We then ran Anrep [16], compiled with gcc v3.3 with flag -O3, over the same set of patterns and the same sequence. The results of this experiment are shown in Figures 6 and 7. Two considerations can be made. First of all, the time needed to build the suffix tree lazily is quickly amortized, so there is no big difference, when many searches are performed on the same sequence, in having a pre-constructed suffix tree or building it incrementally. Second, SMaRTFinder outperforms Anrep in most cases, and it has a much more stable linear behaviour. The running time of Anrep very strongly depends on the statistical technique used to determine the best backtrack order of the search (see [17]). In cases where such strategy is effective (for instance, when the structured motif has one long rare component pattern and the remain- 3 where SM stands for structured motif, and RT reminds of our motivating problem, i.e. finding retrotransposons. 4 Actually, they can work with any predefined scheme for comparing characters. 18

19 ing patterns are very short) Anrep produces better results, but Figures 6 and 7 show that it is usually much slower than our program. 40 Eager SMaRTFinder Anrep Time (s) Number of matches Figure 6: Eager SMaRTFinder compared with Anrep. 40 Lazy SMaRTFinder Anrep Time (s) Number of matches Figure 7: Lazy SMaRTFinder compared with Anrep. As a second experiment, we processed the whole genome of Arabidopsis Thaliana (see [6]) searching for 4-, 5- and 6-occurrences of the structured model 19

20 TNGA[12,14]TWNYTNNA[19,21]TNTMYRT[4,6]WNCCNNNNRG[72,95]TGNNA[100,125]TNTANRTNRAYGA Figure 8: A very well conserved feature of a Copia retrotransposon. shown in Figure 8, which we had obtained by the multiple alignment of several BAC clones of the rice genome (Oryza sativa). Arabidopsis Thaliana was chosen because it is a well studied and annotated genome. This test was done on a Pentium IV 1.6Ghz machine with 512Mb RAM running Linux. Table 3 shows the results, averaged over 10 trials. It can be noticed that allowing missing patterns does not affect the running time too much. The whole genome can be processed in less than 100 seconds, and this time reduces by a factor of 10 if the suffix trees of the sequences are already available. The best qualitative results were obtained allowing one missing pattern, the number of found occurrences being of the same order of magnitude as other results in the literature (e.g., [6]). Besides, in that case we found no false positives, i.e. all found elements we checked are annotated as retroelements, or at least as genes, in current databases. The reason for that lies, of course, in the good quality of the structured model we used. We can draw the following conclusion: when patterns are specified in the IUPAC alphabet and we can combine a certain number of them into a structured motif, the search can be effective even if we do not allow errors inside each pattern (or we allow a small number of errors, e.g. one error). Effectiveness comes out from the combination of the patterns: in this perspective, allowing missing patterns is a valuable feature. Eager Lazy Sequence Size 6/6 5/6 4/6 6/6 5/6 4/6 chr1at 29 Mb chr2at 19 Mb chr3at 22.7 Mb chr4at 16.9 Mb chr5at 25.7 Mb Table 3: Running time of SMaRTFinder for different kinds of search for the structured motif of Figure 8. Time is expressed in seconds. Times for chr1at in the eager case are missing, because the program went out of memory. Nonetheless, the sequence could be processed lazily. 7 Conclusions Our aim was to develop an algorithm to search whole genomes for structured motifs, having the feature of identifying partial matches. We believe that the constraint graph provides the ground for a non-trivial user interface for an evaluation of the results of the search. We are currently developing a graphical interface for presenting the output to the user in an attractive and meaningful way. 20

String Matching with Variable Length Gaps

String Matching with Variable Length Gaps String Matching with Variable Length Gaps Philip Bille, Inge Li Gørtz, Hjalte Wedel Vildhøj, and David Kofoed Wind Technical University of Denmark Abstract. We consider string matching with variable length

More information

Pairing Transitive Closure and Reduction to Efficiently Reason about Partially Ordered Events

Pairing Transitive Closure and Reduction to Efficiently Reason about Partially Ordered Events Pairing Transitive Closure and Reduction to Efficiently Reason about Partially Ordered Events Massimo Franceschet Angelo Montanari Dipartimento di Matematica e Informatica, Università di Udine Via delle

More information

Pairing Transitive Closure and Reduction to Efficiently Reason about Partially Ordered Events

Pairing Transitive Closure and Reduction to Efficiently Reason about Partially Ordered Events Pairing Transitive Closure and Reduction to Efficiently Reason about Partially Ordered Events Massimo Franceschet Angelo Montanari Dipartimento di Matematica e Informatica, Università di Udine Via delle

More information

On Two Class-Constrained Versions of the Multiple Knapsack Problem

On Two Class-Constrained Versions of the Multiple Knapsack Problem On Two Class-Constrained Versions of the Multiple Knapsack Problem Hadas Shachnai Tami Tamir Department of Computer Science The Technion, Haifa 32000, Israel Abstract We study two variants of the classic

More information

arxiv: v1 [cs.ds] 9 Apr 2018

arxiv: v1 [cs.ds] 9 Apr 2018 From Regular Expression Matching to Parsing Philip Bille Technical University of Denmark phbi@dtu.dk Inge Li Gørtz Technical University of Denmark inge@dtu.dk arxiv:1804.02906v1 [cs.ds] 9 Apr 2018 Abstract

More information

Chapter 3 Deterministic planning

Chapter 3 Deterministic planning Chapter 3 Deterministic planning In this chapter we describe a number of algorithms for solving the historically most important and most basic type of planning problem. Two rather strong simplifying assumptions

More information

CSE182-L7. Protein Sequence Analysis Patterns (regular expressions) Profiles HMM Gene Finding CSE182

CSE182-L7. Protein Sequence Analysis Patterns (regular expressions) Profiles HMM Gene Finding CSE182 CSE182-L7 Protein Sequence Analysis Patterns (regular expressions) Profiles HMM Gene Finding 10-07 CSE182 Bell Labs Honors Pattern matching 10-07 CSE182 Just the Facts Consider the set of all substrings

More information

Pattern Matching (Exact Matching) Overview

Pattern Matching (Exact Matching) Overview CSI/BINF 5330 Pattern Matching (Exact Matching) Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Pattern Matching Exhaustive Search DFA Algorithm KMP Algorithm

More information

CPSC 421: Tutorial #1

CPSC 421: Tutorial #1 CPSC 421: Tutorial #1 October 14, 2016 Set Theory. 1. Let A be an arbitrary set, and let B = {x A : x / x}. That is, B contains all sets in A that do not contain themselves: For all y, ( ) y B if and only

More information

Implementing Approximate Regularities

Implementing Approximate Regularities Implementing Approximate Regularities Manolis Christodoulakis Costas S. Iliopoulos Department of Computer Science King s College London Kunsoo Park School of Computer Science and Engineering, Seoul National

More information

Computational Tasks and Models

Computational Tasks and Models 1 Computational Tasks and Models Overview: We assume that the reader is familiar with computing devices but may associate the notion of computation with specific incarnations of it. Our first goal is to

More information

On improving matchings in trees, via bounded-length augmentations 1

On improving matchings in trees, via bounded-length augmentations 1 On improving matchings in trees, via bounded-length augmentations 1 Julien Bensmail a, Valentin Garnero a, Nicolas Nisse a a Université Côte d Azur, CNRS, Inria, I3S, France Abstract Due to a classical

More information

1 Definition of a Turing machine

1 Definition of a Turing machine Introduction to Algorithms Notes on Turing Machines CS 4820, Spring 2017 April 10 24, 2017 1 Definition of a Turing machine Turing machines are an abstract model of computation. They provide a precise,

More information

PATTERN MATCHING WITH SWAPS IN PRACTICE

PATTERN MATCHING WITH SWAPS IN PRACTICE International Journal of Foundations of Computer Science c World Scientific Publishing Company PATTERN MATCHING WITH SWAPS IN PRACTICE MATTEO CAMPANELLI Università di Catania, Scuola Superiore di Catania

More information

Comparing whole genomes

Comparing whole genomes BioNumerics Tutorial: Comparing whole genomes 1 Aim The Chromosome Comparison window in BioNumerics has been designed for large-scale comparison of sequences of unlimited length. In this tutorial you will

More information

Equivalence of Regular Expressions and FSMs

Equivalence of Regular Expressions and FSMs Equivalence of Regular Expressions and FSMs Greg Plaxton Theory in Programming Practice, Spring 2005 Department of Computer Science University of Texas at Austin Regular Language Recall that a language

More information

1 Introduction to information theory

1 Introduction to information theory 1 Introduction to information theory 1.1 Introduction In this chapter we present some of the basic concepts of information theory. The situations we have in mind involve the exchange of information through

More information

On the Monotonicity of the String Correction Factor for Words with Mismatches

On the Monotonicity of the String Correction Factor for Words with Mismatches On the Monotonicity of the String Correction Factor for Words with Mismatches (extended abstract) Alberto Apostolico Georgia Tech & Univ. of Padova Cinzia Pizzi Univ. of Padova & Univ. of Helsinki Abstract.

More information

arxiv: v1 [cs.ds] 21 May 2013

arxiv: v1 [cs.ds] 21 May 2013 Easy identification of generalized common nested intervals Fabien de Montgolfier 1, Mathieu Raffinot 1, and Irena Rusu 2 arxiv:1305.4747v1 [cs.ds] 21 May 2013 1 LIAFA, Univ. Paris Diderot - Paris 7, 75205

More information

Enhancing Active Automata Learning by a User Log Based Metric

Enhancing Active Automata Learning by a User Log Based Metric Master Thesis Computing Science Radboud University Enhancing Active Automata Learning by a User Log Based Metric Author Petra van den Bos First Supervisor prof. dr. Frits W. Vaandrager Second Supervisor

More information

Heuristic Search Algorithms

Heuristic Search Algorithms CHAPTER 4 Heuristic Search Algorithms 59 4.1 HEURISTIC SEARCH AND SSP MDPS The methods we explored in the previous chapter have a serious practical drawback the amount of memory they require is proportional

More information

On the Average Complexity of Brzozowski s Algorithm for Deterministic Automata with a Small Number of Final States

On the Average Complexity of Brzozowski s Algorithm for Deterministic Automata with a Small Number of Final States On the Average Complexity of Brzozowski s Algorithm for Deterministic Automata with a Small Number of Final States Sven De Felice 1 and Cyril Nicaud 2 1 LIAFA, Université Paris Diderot - Paris 7 & CNRS

More information

A Polynomial Time Algorithm for Parsing with the Bounded Order Lambek Calculus

A Polynomial Time Algorithm for Parsing with the Bounded Order Lambek Calculus A Polynomial Time Algorithm for Parsing with the Bounded Order Lambek Calculus Timothy A. D. Fowler Department of Computer Science University of Toronto 10 King s College Rd., Toronto, ON, M5S 3G4, Canada

More information

Optimal Tree-decomposition Balancing and Reachability on Low Treewidth Graphs

Optimal Tree-decomposition Balancing and Reachability on Low Treewidth Graphs Optimal Tree-decomposition Balancing and Reachability on Low Treewidth Graphs Krishnendu Chatterjee Rasmus Ibsen-Jensen Andreas Pavlogiannis IST Austria Abstract. We consider graphs with n nodes together

More information

Efficient Reassembling of Graphs, Part 1: The Linear Case

Efficient Reassembling of Graphs, Part 1: The Linear Case Efficient Reassembling of Graphs, Part 1: The Linear Case Assaf Kfoury Boston University Saber Mirzaei Boston University Abstract The reassembling of a simple connected graph G = (V, E) is an abstraction

More information

Efficient Sequential Algorithms, Comp309

Efficient Sequential Algorithms, Comp309 Efficient Sequential Algorithms, Comp309 University of Liverpool 2010 2011 Module Organiser, Igor Potapov Part 2: Pattern Matching References: T. H. Cormen, C. E. Leiserson, R. L. Rivest Introduction to

More information

arxiv: v2 [cs.ds] 3 Oct 2017

arxiv: v2 [cs.ds] 3 Oct 2017 Orthogonal Vectors Indexing Isaac Goldstein 1, Moshe Lewenstein 1, and Ely Porat 1 1 Bar-Ilan University, Ramat Gan, Israel {goldshi,moshe,porately}@cs.biu.ac.il arxiv:1710.00586v2 [cs.ds] 3 Oct 2017 Abstract

More information

Motivating the need for optimal sequence alignments...

Motivating the need for optimal sequence alignments... 1 Motivating the need for optimal sequence alignments... 2 3 Note that this actually combines two objectives of optimal sequence alignments: (i) use the score of the alignment o infer homology; (ii) use

More information

Theory of Computing Tamás Herendi

Theory of Computing Tamás Herendi Theory of Computing Tamás Herendi Theory of Computing Tamás Herendi Publication date 2014 Table of Contents 1 Preface 1 2 Formal languages 2 3 Order of growth rate 9 4 Turing machines 16 1 The definition

More information

UNIT-VIII COMPUTABILITY THEORY

UNIT-VIII COMPUTABILITY THEORY CONTEXT SENSITIVE LANGUAGE UNIT-VIII COMPUTABILITY THEORY A Context Sensitive Grammar is a 4-tuple, G = (N, Σ P, S) where: N Set of non terminal symbols Σ Set of terminal symbols S Start symbol of the

More information

Lecture Notes: Markov chains

Lecture Notes: Markov chains Computational Genomics and Molecular Biology, Fall 5 Lecture Notes: Markov chains Dannie Durand At the beginning of the semester, we introduced two simple scoring functions for pairwise alignments: a similarity

More information

Hierarchy among Automata on Linear Orderings

Hierarchy among Automata on Linear Orderings Hierarchy among Automata on Linear Orderings Véronique Bruyère Institut d Informatique Université de Mons-Hainaut Olivier Carton LIAFA Université Paris 7 Abstract In a preceding paper, automata and rational

More information

CSC173 Workshop: 13 Sept. Notes

CSC173 Workshop: 13 Sept. Notes CSC173 Workshop: 13 Sept. Notes Frank Ferraro Department of Computer Science University of Rochester September 14, 2010 1 Regular Languages and Equivalent Forms A language can be thought of a set L of

More information

Efficient Enumeration of Regular Languages

Efficient Enumeration of Regular Languages Efficient Enumeration of Regular Languages Margareta Ackerman and Jeffrey Shallit University of Waterloo, Waterloo ON, Canada mackerma@uwaterloo.ca, shallit@graceland.uwaterloo.ca Abstract. The cross-section

More information

CS:4330 Theory of Computation Spring Regular Languages. Finite Automata and Regular Expressions. Haniel Barbosa

CS:4330 Theory of Computation Spring Regular Languages. Finite Automata and Regular Expressions. Haniel Barbosa CS:4330 Theory of Computation Spring 2018 Regular Languages Finite Automata and Regular Expressions Haniel Barbosa Readings for this lecture Chapter 1 of [Sipser 1996], 3rd edition. Sections 1.1 and 1.3.

More information

Branch-and-Bound for the Travelling Salesman Problem

Branch-and-Bound for the Travelling Salesman Problem Branch-and-Bound for the Travelling Salesman Problem Leo Liberti LIX, École Polytechnique, F-91128 Palaiseau, France Email:liberti@lix.polytechnique.fr March 15, 2011 Contents 1 The setting 1 1.1 Graphs...............................................

More information

34.1 Polynomial time. Abstract problems

34.1 Polynomial time. Abstract problems < Day Day Up > 34.1 Polynomial time We begin our study of NP-completeness by formalizing our notion of polynomial-time solvable problems. These problems are generally regarded as tractable, but for philosophical,

More information

Covering Linear Orders with Posets

Covering Linear Orders with Posets Covering Linear Orders with Posets Proceso L. Fernandez, Lenwood S. Heath, Naren Ramakrishnan, and John Paul C. Vergara Department of Information Systems and Computer Science, Ateneo de Manila University,

More information

Implicitely and Densely Discrete Black-Box Optimization Problems

Implicitely and Densely Discrete Black-Box Optimization Problems Implicitely and Densely Discrete Black-Box Optimization Problems L. N. Vicente September 26, 2008 Abstract This paper addresses derivative-free optimization problems where the variables lie implicitly

More information

1 Some loose ends from last time

1 Some loose ends from last time Cornell University, Fall 2010 CS 6820: Algorithms Lecture notes: Kruskal s and Borůvka s MST algorithms September 20, 2010 1 Some loose ends from last time 1.1 A lemma concerning greedy algorithms and

More information

Complexity Theory Part II

Complexity Theory Part II Complexity Theory Part II Time Complexity The time complexity of a TM M is a function denoting the worst-case number of steps M takes on any input of length n. By convention, n denotes the length of the

More information

Tableau-based decision procedures for the logics of subinterval structures over dense orderings

Tableau-based decision procedures for the logics of subinterval structures over dense orderings Tableau-based decision procedures for the logics of subinterval structures over dense orderings Davide Bresolin 1, Valentin Goranko 2, Angelo Montanari 3, and Pietro Sala 3 1 Department of Computer Science,

More information

Chapter 4: Computation tree logic

Chapter 4: Computation tree logic INFOF412 Formal verification of computer systems Chapter 4: Computation tree logic Mickael Randour Formal Methods and Verification group Computer Science Department, ULB March 2017 1 CTL: a specification

More information

Kleene Algebras and Algebraic Path Problems

Kleene Algebras and Algebraic Path Problems Kleene Algebras and Algebraic Path Problems Davis Foote May 8, 015 1 Regular Languages 1.1 Deterministic Finite Automata A deterministic finite automaton (DFA) is a model of computation that can simulate

More information

Computational Aspects of Aggregation in Biological Systems

Computational Aspects of Aggregation in Biological Systems Computational Aspects of Aggregation in Biological Systems Vladik Kreinovich and Max Shpak University of Texas at El Paso, El Paso, TX 79968, USA vladik@utep.edu, mshpak@utep.edu Summary. Many biologically

More information

an efficient procedure for the decision problem. We illustrate this phenomenon for the Satisfiability problem.

an efficient procedure for the decision problem. We illustrate this phenomenon for the Satisfiability problem. 1 More on NP In this set of lecture notes, we examine the class NP in more detail. We give a characterization of NP which justifies the guess and verify paradigm, and study the complexity of solving search

More information

UNIT-III REGULAR LANGUAGES

UNIT-III REGULAR LANGUAGES Syllabus R9 Regulation REGULAR EXPRESSIONS UNIT-III REGULAR LANGUAGES Regular expressions are useful for representing certain sets of strings in an algebraic fashion. In arithmetic we can use the operations

More information

Introduction. An Introduction to Algorithms and Data Structures

Introduction. An Introduction to Algorithms and Data Structures Introduction An Introduction to Algorithms and Data Structures Overview Aims This course is an introduction to the design, analysis and wide variety of algorithms (a topic often called Algorithmics ).

More information

Lexical Analysis. Reinhard Wilhelm, Sebastian Hack, Mooly Sagiv Saarland University, Tel Aviv University.

Lexical Analysis. Reinhard Wilhelm, Sebastian Hack, Mooly Sagiv Saarland University, Tel Aviv University. Lexical Analysis Reinhard Wilhelm, Sebastian Hack, Mooly Sagiv Saarland University, Tel Aviv University http://compilers.cs.uni-saarland.de Compiler Construction Core Course 2017 Saarland University Today

More information

BLAST: Basic Local Alignment Search Tool

BLAST: Basic Local Alignment Search Tool .. CSC 448 Bioinformatics Algorithms Alexander Dekhtyar.. (Rapid) Local Sequence Alignment BLAST BLAST: Basic Local Alignment Search Tool BLAST is a family of rapid approximate local alignment algorithms[2].

More information

Uses of finite automata

Uses of finite automata Chapter 2 :Finite Automata 2.1 Finite Automata Automata are computational devices to solve language recognition problems. Language recognition problem is to determine whether a word belongs to a language.

More information

arxiv: v1 [cs.ds] 15 Feb 2012

arxiv: v1 [cs.ds] 15 Feb 2012 Linear-Space Substring Range Counting over Polylogarithmic Alphabets Travis Gagie 1 and Pawe l Gawrychowski 2 1 Aalto University, Finland travis.gagie@aalto.fi 2 Max Planck Institute, Germany gawry@cs.uni.wroc.pl

More information

(Refer Slide Time: 0:21)

(Refer Slide Time: 0:21) Theory of Computation Prof. Somenath Biswas Department of Computer Science and Engineering Indian Institute of Technology Kanpur Lecture 7 A generalisation of pumping lemma, Non-deterministic finite automata

More information

Theoretical Computer Science

Theoretical Computer Science Theoretical Computer Science 410 (2009) 2759 2766 Contents lists available at ScienceDirect Theoretical Computer Science journal homepage: www.elsevier.com/locate/tcs Note Computing the longest topological

More information

Formal Modeling with Propositional Logic

Formal Modeling with Propositional Logic Formal Modeling with Propositional Logic Assaf Kfoury February 6, 2017 (last modified: September 3, 2018) Contents 1 The Pigeon Hole Principle 2 2 Graph Problems 3 2.1 Paths in Directed Graphs..................................

More information

Lecture 3: Nondeterministic Finite Automata

Lecture 3: Nondeterministic Finite Automata Lecture 3: Nondeterministic Finite Automata September 5, 206 CS 00 Theory of Computation As a recap of last lecture, recall that a deterministic finite automaton (DFA) consists of (Q, Σ, δ, q 0, F ) where

More information

3515ICT: Theory of Computation. Regular languages

3515ICT: Theory of Computation. Regular languages 3515ICT: Theory of Computation Regular languages Notation and concepts concerning alphabets, strings and languages, and identification of languages with problems (H, 1.5). Regular expressions (H, 3.1,

More information

Tandem Mass Spectrometry: Generating function, alignment and assembly

Tandem Mass Spectrometry: Generating function, alignment and assembly Tandem Mass Spectrometry: Generating function, alignment and assembly With slides from Sangtae Kim and from Jones & Pevzner 2004 Determining reliability of identifications Can we use Target/Decoy to estimate

More information

Compilers. Lexical analysis. Yannis Smaragdakis, U. Athens (original slides by Sam

Compilers. Lexical analysis. Yannis Smaragdakis, U. Athens (original slides by Sam Compilers Lecture 3 Lexical analysis Yannis Smaragdakis, U. Athens (original slides by Sam Guyer@Tufts) Big picture Source code Front End IR Back End Machine code Errors Front end responsibilities Check

More information

Lecture Notes on Inductive Definitions

Lecture Notes on Inductive Definitions Lecture Notes on Inductive Definitions 15-312: Foundations of Programming Languages Frank Pfenning Lecture 2 September 2, 2004 These supplementary notes review the notion of an inductive definition and

More information

Deterministic Finite Automaton (DFA)

Deterministic Finite Automaton (DFA) 1 Lecture Overview Deterministic Finite Automata (DFA) o accepting a string o defining a language Nondeterministic Finite Automata (NFA) o converting to DFA (subset construction) o constructed from a regular

More information

Automata & languages. A primer on the Theory of Computation. Laurent Vanbever. ETH Zürich (D-ITET) September,

Automata & languages. A primer on the Theory of Computation. Laurent Vanbever.  ETH Zürich (D-ITET) September, Automata & languages A primer on the Theory of Computation Laurent Vanbever www.vanbever.eu ETH Zürich (D-ITET) September, 24 2015 Last week was all about Deterministic Finite Automaton We saw three main

More information

Compressed Index for Dynamic Text

Compressed Index for Dynamic Text Compressed Index for Dynamic Text Wing-Kai Hon Tak-Wah Lam Kunihiko Sadakane Wing-Kin Sung Siu-Ming Yiu Abstract This paper investigates how to index a text which is subject to updates. The best solution

More information

Algorithms. NP -Complete Problems. Dong Kyue Kim Hanyang University

Algorithms. NP -Complete Problems. Dong Kyue Kim Hanyang University Algorithms NP -Complete Problems Dong Kyue Kim Hanyang University dqkim@hanyang.ac.kr The Class P Definition 13.2 Polynomially bounded An algorithm is said to be polynomially bounded if its worst-case

More information

Lecture 18 April 26, 2012

Lecture 18 April 26, 2012 6.851: Advanced Data Structures Spring 2012 Prof. Erik Demaine Lecture 18 April 26, 2012 1 Overview In the last lecture we introduced the concept of implicit, succinct, and compact data structures, and

More information

Let us first give some intuitive idea about a state of a system and state transitions before describing finite automata.

Let us first give some intuitive idea about a state of a system and state transitions before describing finite automata. Finite Automata Automata (singular: automation) are a particularly simple, but useful, model of computation. They were initially proposed as a simple model for the behavior of neurons. The concept of a

More information

Critical Reading of Optimization Methods for Logical Inference [1]

Critical Reading of Optimization Methods for Logical Inference [1] Critical Reading of Optimization Methods for Logical Inference [1] Undergraduate Research Internship Department of Management Sciences Fall 2007 Supervisor: Dr. Miguel Anjos UNIVERSITY OF WATERLOO Rajesh

More information

On the minimum neighborhood of independent sets in the n-cube

On the minimum neighborhood of independent sets in the n-cube Matemática Contemporânea, Vol. 44, 1 10 c 2015, Sociedade Brasileira de Matemática On the minimum neighborhood of independent sets in the n-cube Moysés da S. Sampaio Júnior Fabiano de S. Oliveira Luérbio

More information

Computer Science 385 Analysis of Algorithms Siena College Spring Topic Notes: Limitations of Algorithms

Computer Science 385 Analysis of Algorithms Siena College Spring Topic Notes: Limitations of Algorithms Computer Science 385 Analysis of Algorithms Siena College Spring 2011 Topic Notes: Limitations of Algorithms We conclude with a discussion of the limitations of the power of algorithms. That is, what kinds

More information

Generating p-extremal graphs

Generating p-extremal graphs Generating p-extremal graphs Derrick Stolee Department of Mathematics Department of Computer Science University of Nebraska Lincoln s-dstolee1@math.unl.edu August 2, 2011 Abstract Let f(n, p be the maximum

More information

Computational Biology: Basics & Interesting Problems

Computational Biology: Basics & Interesting Problems Computational Biology: Basics & Interesting Problems Summary Sources of information Biological concepts: structure & terminology Sequencing Gene finding Protein structure prediction Sources of information

More information

Indexing LZ77: The Next Step in Self-Indexing. Gonzalo Navarro Department of Computer Science, University of Chile

Indexing LZ77: The Next Step in Self-Indexing. Gonzalo Navarro Department of Computer Science, University of Chile Indexing LZ77: The Next Step in Self-Indexing Gonzalo Navarro Department of Computer Science, University of Chile gnavarro@dcc.uchile.cl Part I: Why Jumping off the Cliff The Past Century Self-Indexing:

More information

An Optimization-Based Heuristic for the Split Delivery Vehicle Routing Problem

An Optimization-Based Heuristic for the Split Delivery Vehicle Routing Problem An Optimization-Based Heuristic for the Split Delivery Vehicle Routing Problem Claudia Archetti (1) Martin W.P. Savelsbergh (2) M. Grazia Speranza (1) (1) University of Brescia, Department of Quantitative

More information

Automata on linear orderings

Automata on linear orderings Automata on linear orderings Véronique Bruyère Institut d Informatique Université de Mons-Hainaut Olivier Carton LIAFA Université Paris 7 September 25, 2006 Abstract We consider words indexed by linear

More information

Phylogenetic Networks, Trees, and Clusters

Phylogenetic Networks, Trees, and Clusters Phylogenetic Networks, Trees, and Clusters Luay Nakhleh 1 and Li-San Wang 2 1 Department of Computer Science Rice University Houston, TX 77005, USA nakhleh@cs.rice.edu 2 Department of Biology University

More information

CONCATENATION AND KLEENE STAR ON DETERMINISTIC FINITE AUTOMATA

CONCATENATION AND KLEENE STAR ON DETERMINISTIC FINITE AUTOMATA 1 CONCATENATION AND KLEENE STAR ON DETERMINISTIC FINITE AUTOMATA GUO-QIANG ZHANG, XIANGNAN ZHOU, ROBERT FRASER, LICONG CUI Department of Electrical Engineering and Computer Science, Case Western Reserve

More information

Theoretical Computer Science

Theoretical Computer Science Theoretical Computer Science 443 (2012) 25 34 Contents lists available at SciVerse ScienceDirect Theoretical Computer Science journal homepage: www.elsevier.com/locate/tcs String matching with variable

More information

Compiler Design. Spring Lexical Analysis. Sample Exercises and Solutions. Prof. Pedro Diniz

Compiler Design. Spring Lexical Analysis. Sample Exercises and Solutions. Prof. Pedro Diniz Compiler Design Spring 2010 Lexical Analysis Sample Exercises and Solutions Prof. Pedro Diniz USC / Information Sciences Institute 4676 Admiralty Way, Suite 1001 Marina del Rey, California 90292 pedro@isi.edu

More information

O 3 O 4 O 5. q 3. q 4. Transition

O 3 O 4 O 5. q 3. q 4. Transition Hidden Markov Models Hidden Markov models (HMM) were developed in the early part of the 1970 s and at that time mostly applied in the area of computerized speech recognition. They are first described in

More information

Multiple-Dispatching Based on Automata

Multiple-Dispatching Based on Automata Published in: Journal of Theory and Practice of Object Systems, 1(1), 1995. Multiple-Dispatching ased on Automata Weimin Chen GMD-IPSI Integrated Publication and Information Systems Institute Dolivostr.

More information

Maximum flow problem (part I)

Maximum flow problem (part I) Maximum flow problem (part I) Combinatorial Optimization Giovanni Righini Università degli Studi di Milano Definitions A flow network is a digraph D = (N,A) with two particular nodes s and t acting as

More information

Compiler Design. Spring Lexical Analysis. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Compiler Design. Spring Lexical Analysis. Sample Exercises and Solutions. Prof. Pedro C. Diniz Compiler Design Spring 2011 Lexical Analysis Sample Exercises and Solutions Prof. Pedro C. Diniz USC / Information Sciences Institute 4676 Admiralty Way, Suite 1001 Marina del Rey, California 90292 pedro@isi.edu

More information

This lecture covers Chapter 5 of HMU: Context-free Grammars

This lecture covers Chapter 5 of HMU: Context-free Grammars This lecture covers Chapter 5 of HMU: Context-free rammars (Context-free) rammars (Leftmost and Rightmost) Derivations Parse Trees An quivalence between Derivations and Parse Trees Ambiguity in rammars

More information

BIOINFORMATICS: An Introduction

BIOINFORMATICS: An Introduction BIOINFORMATICS: An Introduction What is Bioinformatics? The term was first coined in 1988 by Dr. Hwa Lim The original definition was : a collective term for data compilation, organisation, analysis and

More information

Biological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor

Biological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor Biological Networks:,, and via Relative Description Length By: Tamir Tuller & Benny Chor Presented by: Noga Grebla Content of the presentation Presenting the goals of the research Reviewing basic terms

More information

Algorithms Exam TIN093 /DIT602

Algorithms Exam TIN093 /DIT602 Algorithms Exam TIN093 /DIT602 Course: Algorithms Course code: TIN 093, TIN 092 (CTH), DIT 602 (GU) Date, time: 21st October 2017, 14:00 18:00 Building: SBM Responsible teacher: Peter Damaschke, Tel. 5405

More information

Evolutionary Models. Evolutionary Models

Evolutionary Models. Evolutionary Models Edit Operators In standard pairwise alignment, what are the allowed edit operators that transform one sequence into the other? Describe how each of these edit operations are represented on a sequence alignment

More information

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison 10-810: Advanced Algorithms and Models for Computational Biology microrna and Whole Genome Comparison Central Dogma: 90s Transcription factors DNA transcription mrna translation Proteins Central Dogma:

More information

FINAL EXAM PRACTICE PROBLEMS CMSC 451 (Spring 2016)

FINAL EXAM PRACTICE PROBLEMS CMSC 451 (Spring 2016) FINAL EXAM PRACTICE PROBLEMS CMSC 451 (Spring 2016) The final exam will be on Thursday, May 12, from 8:00 10:00 am, at our regular class location (CSI 2117). It will be closed-book and closed-notes, except

More information

Deterministic Finite Automata (DFAs)

Deterministic Finite Automata (DFAs) CS/ECE 374: Algorithms & Models of Computation, Fall 28 Deterministic Finite Automata (DFAs) Lecture 3 September 4, 28 Chandra Chekuri (UIUC) CS/ECE 374 Fall 28 / 33 Part I DFA Introduction Chandra Chekuri

More information

The State Explosion Problem

The State Explosion Problem The State Explosion Problem Martin Kot August 16, 2003 1 Introduction One from main approaches to checking correctness of a concurrent system are state space methods. They are suitable for automatic analysis

More information

CISC 4090: Theory of Computation Chapter 1 Regular Languages. Section 1.1: Finite Automata. What is a computer? Finite automata

CISC 4090: Theory of Computation Chapter 1 Regular Languages. Section 1.1: Finite Automata. What is a computer? Finite automata CISC 4090: Theory of Computation Chapter Regular Languages Xiaolan Zhang, adapted from slides by Prof. Werschulz Section.: Finite Automata Fordham University Department of Computer and Information Sciences

More information

P, NP, NP-Complete, and NPhard

P, NP, NP-Complete, and NPhard P, NP, NP-Complete, and NPhard Problems Zhenjiang Li 21/09/2011 Outline Algorithm time complicity P and NP problems NP-Complete and NP-Hard problems Algorithm time complicity Outline What is this course

More information

Foundations of

Foundations of 91.304 Foundations of (Theoretical) Computer Science Chapter 1 Lecture Notes (Section 1.3: Regular Expressions) David Martin dm@cs.uml.edu d with some modifications by Prof. Karen Daniels, Spring 2012

More information

Unranked Tree Automata with Sibling Equalities and Disequalities

Unranked Tree Automata with Sibling Equalities and Disequalities Unranked Tree Automata with Sibling Equalities and Disequalities Wong Karianto Christof Löding Lehrstuhl für Informatik 7, RWTH Aachen, Germany 34th International Colloquium, ICALP 2007 Xu Gao (NFS) Unranked

More information

Show that the following problems are NP-complete

Show that the following problems are NP-complete Show that the following problems are NP-complete April 7, 2018 Below is a list of 30 exercises in which you are asked to prove that some problem is NP-complete. The goal is to better understand the theory

More information

Topics in Algorithms. 1 Generation of Basic Combinatorial Objects. Exercises. 1.1 Generation of Subsets. 1. Consider the sum:

Topics in Algorithms. 1 Generation of Basic Combinatorial Objects. Exercises. 1.1 Generation of Subsets. 1. Consider the sum: Topics in Algorithms Exercises 1 Generation of Basic Combinatorial Objects 1.1 Generation of Subsets 1. Consider the sum: (ε 1,...,ε n) {0,1} n f(ε 1,..., ε n ) (a) Show that it may be calculated in O(n)

More information

Lecture Notes On THEORY OF COMPUTATION MODULE -1 UNIT - 2

Lecture Notes On THEORY OF COMPUTATION MODULE -1 UNIT - 2 BIJU PATNAIK UNIVERSITY OF TECHNOLOGY, ODISHA Lecture Notes On THEORY OF COMPUTATION MODULE -1 UNIT - 2 Prepared by, Dr. Subhendu Kumar Rath, BPUT, Odisha. UNIT 2 Structure NON-DETERMINISTIC FINITE AUTOMATA

More information

A graph-theoretic approach to efficiently reason about partially ordered events in the Event Calculus

A graph-theoretic approach to efficiently reason about partially ordered events in the Event Calculus A graph-theoretic approach to efficiently reason about partially ordered events in the Event Calculus Massimo Franceschet Angelo Montanari Dipartimento di Matematica e Informatica, Università di Udine

More information

Lecture notes on Turing machines

Lecture notes on Turing machines Lecture notes on Turing machines Ivano Ciardelli 1 Introduction Turing machines, introduced by Alan Turing in 1936, are one of the earliest and perhaps the best known model of computation. The importance

More information