Structured Motifs Search

Size: px

Start display at page:

Download "Structured Motifs Search"

Bertina Harmon
6 years ago
Views:

1 Structured Motifs Search Technical Report UDIMI/15/2003/RR Michele Morgante 1, Alberto Policriti 2, Nicola Vitacolonna 2 and Andrea Zuccolo 1 University of Udine 1 Department of Crop Sciences and Agricultural Engineering 2 Department of Mathematics and Computer Science via delle Scienze 206, Udine (Italy) 19th September 2003 Abstract In this paper we describe an algorithm for the localization of structured models, i.e. sequences of (simple) motifs and distance constraints. It basically combines standard pattern matching procedures with a constraint satisfaction solver, and it has the ability, not present in similar tools, to search for partial matches. A significant feature of our approach, especially in terms of efficiency for the application context, is that the (potentially) exponentially many solutions to the considered problem are represented in compact form as a graph. Moreover, the time and space necessary to build the graph are linear in the number of occurrences of the component patterns. 1 Introduction A relevant task of computational biology consists in helping identify conserved features in a set of DNA or protein sequences. This definition actually covers a range of problems, from isolating previously unknown domains to finding occurrences of some kind of consensus model that has already been determined. In this paper, we assume that domains, signals, consensus models and so on can always be represented in a particular form which we will call, following the terminology used by other authors, structured motifs (or structured models), which can be thought of as compound patterns made of a list of motifs, or patterns, and a list of intervals that specify at what distances adjacent motifs should occur (see [4]). To help distinguish the two main classes of problems mentioned above, we will say that a de novo identification of signals in sequences is a problem of structured motif extraction, while the problem of finding the positions in a sequence where occurrences of a given model are is a problem of structured motif localization. The former has been widely investigated and there exist elaborate proposals in the literature ([1, 2, 4, 13, 14, 20] is an incomplete list). 1

2 We will focus our attention on the problem of localization. Since, from a syntactic point of view, structured motifs are a special kind of regular expressions 1, the naïve approach is to reduce the problem to a regular pattern matching problem ([4, 21]). However, given the simplified form structured motifs exhibit as regular expressions, this is not the best way to go. There are different approaches in the literature, notably [19] and [16, 17], which we will discuss in Section 3. In this paper we describe an algorithm for the localization of structured models, which can be thought of as compound patterns made of a list of simple motifs and a list of intervals that specify at what distances adjacent motifs should occur (see [4]). Our algorithm combines standard pattern matching procedures with a constraint satisfaction solver whose peculiar form allows us to represent the (potentially) exponentially many solutions as a graph in an efficient way. The features of our proposal include flexibility (any exact or approximate pattern/regular expression matching algorithm or alignment procedure can be used to search for simple motifs), the ability of searching for partial occurrences of a structured motif (i.e., the ones for which a limited number of component patterns is missing) and negative gaps (which imply potential overlapping of component patterns). As a byproduct, we obtain a data structure which can be used to store all relevant information about the occurrences of a structured model in a sequence. The possibility to produce the solution set in compact form is based on a view of the occurrences of a given motif as an interval of positions in the searched string. A number of simple results follow from this view and allow the design of fast manipulation procedures. Moreover, the compactness of the data structure and the size of the output can be reduced to linear, based on the above mentioned interval representation. The paper is organized as follows: in Section 2 we depict the context in which our algorithm was developed and for what purposes; Section 3 describes related approaches to the problem; Section 4 defines the basic notions needed further; in Section 5 the algorithm is described in full detail; Section 6 reports some experimental results; finally, Section 7 summarizes what we have done and gives some hint for future work. 2 The application context In many biological contexts, already available experimental data allow biologists to determine a more or less precise model of what they are searching for, for instance, by a multiple alignment of a set of homologous sequences. Such models are usually built on the ground of the analysis of a relatively small data set, and they are used to perform further searches in other sequences. Informally, such models can often be translated into structured motifs. The biological problem we initially faced was to automatically find a particular kind of transposable elements called LTR retrotransposons ([11], [12]) in the rice genome. LTR retrotransposons are characterized by an approximate direct repeat (the so called Long Terminal Repeats, LTRs for short) that varies in size from 100 base pairs to several kilobases. Therefore, one can try to locate retrotransposons by identifying such repeats. This can be done by programs like REPuter ([13]) or by more specific tools like LTR STRUC ([15]). 1 when distances between component patterns are nonnegative. 2

3 Sometimes, though, one or both LTRs are missing, so this approach cannot be pursued and it is necessary to search for conserved domains inside the retroelement. In one year of work, we have recollected experimental data from about 10% of the rice genome, allowing us to localize several LTR retrotransposons and to establish, by multiple alignment, a number of conserved features ([3, 22]). For example, many retrotransposons belonging to the Ty1-copia group contain, in correspondence to the gene encoding the reverse transcriptase, a match of the following structured motif: MT[115, 136]MTNTAYGG[121, 151]GTNGAYGAY, which consists of three patterns (MT, MTNTAYGG and GTNGAYGAY) written in the IUPAC alphabet, and two intervals imposing constraints on the relative distances between adjacent patterns. Typically, the component patterns of such structured motifs are from a few to a few hundreds bases long, and gaps can be even several dozens bases. Although patterns are usually specified using the IUPAC alphabet, an effective search requires allowing a positive (although in general very small) number of errors. But that is not enough yet: when searching for complex structured motifs it is likely that not all of their component patterns will be found in the searched sequence. Nonetheless, if enough patterns are present, or if the more important are found, we may say that an approximate occurrence of the structured motif has been found. This leads us to the possibility of missing patterns and to the introduction of weights modelling different significance of patterns together with some threshold check. In this paper, we limit ourself to consider the case without weights. 3 Related work Structured motifs are called Classes of Characters and Bounded Gaps (CBG) expressions in [19]. Although their definition is equivalent to the one used in this paper, the use of these expressions is quite different: in [19], the underlying motivation is giving an efficient algorithm for finding aminoacidic sequences in a database like PROSITE. A sequence of this kind is, for example, [RK]- x(2, 3)-[DE]-x(2, 3)-Y, where the brackets match any of the letters inside and x(2, 3) is a gap of length between 2 and 3. In practice, the considered CBG expressions are usually not very long: the parts between gaps typically consist of a few characters and distances also span a few letters. In their experiments, Navarro et al. used patterns for which the maximum length of a match did not exceed the number of bits of a computer word (typically, 32 or 64). The algorithm given in [19] is very efficient in solving the problem of finding such patterns, by making use of bit parallelism to implement a non-deterministic ε-automaton. But their algorithm becomes less efficient as gaps get longer, because the size of the automaton increases accordingly. Also, bit operations are more costly when they have to be done on several computer words instead of one. Myers et al. [16, 17] give an algorithm for finding network expressions with spacers, where a network expression is a regular expression without Kleene star and spacers are distance ranges between adjacent network expressions. Two 3

4 algorithms are described in [17]: the first is an approximate regular expression matching procedure based on the construction of an ε-automaton called an alignment graph; the second is a backtracking procedure with an optimization technique to determine the best (according to some statistical criterion) backtrack order. The algorithm allows negative gaps, e.g. M 1 5, 9 M 2 means that the left end of an occurrence of motif M 2 must be found between 5 symbols to the left and 9 symbols to the right of the right end of an occurrence of motif M 1. As Myers et al., we use a separate subroutine in order to search for component motifs but our approach differs from the one presented in [17] in that we do not use a backtracking strategy to find structured motifs instead, we build and solve a constraint satisfaction problem. Constraint solving techniques have already been employed to the search of patterns in biosequences, e.g. [7]. However, as we only consider distance contraints, the CSP we obtain is simpler than most other proposals in the literature, and the method we use to solve it is more specialized. 4 Preliminary notions Biologists and computer scientists use different terms to denote the same concepts. We will consider character, nucleotide, base and symbol to be equivalent, and we will use the term motif as a synonym for pattern. When we want to stress that we are talking of a motif/pattern which is part of a structured motif, we will talk of simple motifs or component patterns. For the sake of simplicity, we restrict ourselves to DNA, but what we say can be applied to RNA and proteins as well. Let Σ be a finite set of symbols, called the alphabet. A string (or sequence) S = s 1 s n over Σ is a finite sequence of symbols in Σ. The length of S is S = n. With Σ we denote the set of all strings over Σ, including the empty string ε (whose length is ε = 0). A language is a subset of Σ. A string F is a factor of another string S = s 1 s n if F = s i s j for some 1 i j n. The empty string ε is a factor of any string. Throughout this paper, we fix Σ DNA = {a, c, g, t}, i.e. the alphabet of DNA, and Σ IUPAC = {A, C, G, T, R, Y, W, S, M, K, H, B, V, D, N}, i.e. the standard IUPAC alphabet used to denote nucleic acids. The interpretation of the different symbols in Σ IUPAC is shown in Table 1. We will assume that the text to be searched is a string over Σ DNA. Patterns can be viewed as specifications of languages. In the simplest case a pattern defines only one sequence of characters, i.e. the singleton set containing the string itself. Finite sets of strings can be specified in the standard IUPAC alphabet, e.g. AWTR defines the set {aata, aatg, atta, attg}, and by network expressions ([17]), which are obtained by concatenation and union of symbols, e.g. (ag tt) c corresponds to {agc, ttc}. Another possibility consists in specifying an aminoacidic pattern to be searched for in a DNA sequence: for example, the polypeptide QF (a glutamine followed by a phenylalanine) defines the set of nucleotide sequences {caattt, caattc, cagttt, cagttc}. Let L(P ) be the language generated by P. Given a sequence S = s 1 s n Σ DNA and a pattern P such that L(P ) Σ DNA, we say that a factor F of S exactly matches P if F L(P ). For instance, aata exactly matches the different patterns aata, AWTR and (aa ag) t (a c). 4

5 A {a} adenine C {c} cytosine G {g} guanine T {t} thymine R {a, g} purine Y {t, c} pyrimidine W {a, t} adenine or thymine S {g, c} guanine or cytosine M {a, c} adenine or cytosine K {g, t} guanine or thymine H {a, t, c} adenine, thymine or cytosine B {g, c, t} guanine, cytosine or thymine V {g, a, c} guanine, adenine or cytosine D {g, a, t} guanine, adenine or thymine N {g, a, t, c} guanine, adenine, thymine or cytosine Table 1: The IUPAC alphabet for nucleic acids. To define inexact matching, suppose that a scoring function δ : Σ DNA Σ DNA R, expressing the degree of similarity between two sequences, is given. Then, we say that a factor F of S approximately matches a pattern P within a fixed threshold t R if there exists F L(P ) such that δ(f, F ) t. A common choice is to define δ(f, F ) as the minimum number of insertions, deletions and substitutions to be performed in order to obtain F from F, that is as the edit (or Levenshtein) distance between F and F. A structured motif, or model, is a pair (P, I), where P = (P 1,..., P k ) is a sequence of patterns and I = ([l 1, u 1 ],..., [l k 1, u k 1 ]), with l i, u i Z and l i u i for 1 i < k, is a sequence of (closed) intervals (see [4]). A structured motif is usually written as an expression of the form P 1 [l 1, u 1 ]P 2 [l 2, u 2 ]P 3 P k 1 [l k 1, u k 1 ]P k. An exact match of (P, I) in a sequence S is a k-tuple (F 1,..., F k ) of factors of S such that F i matches P i, for 1 i k, possibly according to different matching criteria; the distance d i between the end position of F i and the start position of F i+1 is such that l i d i u i, for 1 i < k. Note that, although the structured motif is said to occur exactly, some or even all the component patterns may occur approximately. If the structured motif has many component patterns, it is realistic to assume that some of them do not occur in the scanned sequence, at least at the right positions. It is important, then, to be able to search for partial occurrences. To do that, we must impose constraints on non-adjacent patterns. We discuss how such constraints can be reasonably derived from the definition of a structured motif in Section 5.2. For now, we assume that for each pair of simple motifs P i and P j a constraint C(i, j) on their relative distance can be defined. So, given 5

6 an integer q such that 1 q k, an approximate q-occurrence of (P, I) in a sequence S is a q-tuple (F 1,..., F q ) of factors of S such that there is a sequence of indices 1 j 1 < < j q k such that F i matches P ji, for 1 i q, possibly according to different matching criteria; the distance d i between the end position of F i and the start position of F i+1 is such that the constraint C(j i, j i+1 ) is satisfied, for 1 i < q. The problem discussed in Section 2 is formalized as follows: given a structured model (P, I) with k component patterns, a sequence S Σ DNA and an integer q such that 1 q k, find all q-occurrences of (P, I) in S. In what follows, and in the implementation of our algorithm, we consider patterns over Σ IUPAC, matched either exactly or within a given edit distance. This is no loss of generality, since our algorithm can be easily generalized. A partial justification of our choice is given in Section 6. 5 The algorithm The algorithm we propose is a two-step procedure: 1. find the occurrences of all the component patterns of the structured model; 2. combine the occurrences that satisfy the distance constraints into a structured motif. The first step reduces to using some standard pattern matching algorithm. The second step amounts to the definition of a Constraint Satisfaction Problem (CSP) over finite domains. The CSP is solved by building and manipulating a directed acyclic graph of the occurrences. 5.1 Searching for simple motifs Since efficient algorithms exist for many of the pattern matching problems discussed in Section 4, we can start searching for all the component patterns of (P, I) independently, and then combine those occurrences which fit the model. Given a sequence S = s 1 s n and a structured motif (P, I), the first step of our algorithm is: find all the occurrences of P 1,..., P k in S with edit distances e 1,..., e k, respectively. We assume that the output of the search is the list, sorted by start position, of the occurrences of the P i s, where each occurrence is a pair (s, f) representing its start and its end position. For convenience, we define the end position f to be one character beyond the last character of a match. The length of an occurrence of P i can range from P i e i to P i + e i. Table 2 shows the result of searching for the component patterns of a structured motif in the sample sequence of Figure 1. We say that a match of a component pattern P i is contained into another if there are two occurrences (s, f) and (s, f ) of P i such that s s < f < f. A match to P i can be contained into another only if P i is searched for with e i > 0. So forth, we assume that the search for simple motifs never produces an 6

7 T G T a a a a C T g g T G T C T c a G G a G G C T c g C T c c G G a T A G G a A C A a A C A a a A C A a a a a a Figure 1: A sample DNA sequence. The parts of the sequence matching some component pattern of the structured model in Table 2 are shown with upper-case letters. tgt r 11 = (1, 4) r 12 = (12, 15) ct r 21 = (8, 10) r 22 = (15, 17) r 23 = (24, 26) r 24 = (28, 30) gg r 31 = (19, 21) r 32 = (22, 24) r 33 = (32, 34) r 34 = (37, 39) ta r 41 = (35, 37) aca r 51 = (40, 43) r 52 = (44, 47) r 53 = (49, 52) Table 2: The result of the exact search for the component patterns of the structured model tgt[4, 20]ct[2, 5]gg[0, 1]ta[2, 15]aca in the sequence shown in Figure 1. occurrence that is contained into another. In particular, no two matches of the same pattern can have the same start position. The reason for this technical condition is explained in Section 5.3. The complexity of this step depends on which kind of search is required. When e i = 0 for all i, we may apply the Aho-Corasick algorithm, which runs in linear time. The same complexity can be achieved by preprocessing the sequence to build a suffix tree, which is what we have done in our implementation (see [10] for a thorough discussion of these topics). If e i > 0 for some i, some approximate pattern matching procedure is needed. Multiple approximate string matching algorithms exist that achieve optimal average performance (see [8]). Finally, the restriction that the output be sorted usually does not require further computations, because most algorithms already produce a sorted output. 5.2 Building the constraint graph The more interesting part of our procedure is relative to the way in which we can combine different occurrences of simple motifs into a valid occurrence of the structured model. In order to do this, we must build and solve a constraint satisfaction problem (CSP), which, in general, is constituted by a finite sequence of variables X x 1,..., x k with respective domains D 1,..., D k, together with a finite set C of constraints, each on a subsequence of X. We will need the following notion: given a subsequence X of X, the subset of constraints in C in which only variables of X occur is called the projection of C on X, denoted C[X ]. Let D i = {(s i1, f i1 ),..., (s imi, f imi )}, for 1 i k, be the set of the start and end positions of the found occurrences of P i. A constraint system with k variables x 1,..., x k, such that x i D i, can be defined as follows. Adjacent patterns are constrained as specified by the structured model: for 1 i < k we 7

8 must have l i (x i+1 ) 1 (x i ) 2 u i, (1) where ( ) 1 and ( ) 2 are the first and the second component of a pair, respectively. We denote the set of these constraints, for 1 i < k, by C Adj. As we admit the possibility of missing at most k q patterns, we must derive constraints on non-adjacent patterns. How to do this is a matter of biological considerations rather than algorithmic ones. The question is: why is a pattern not found? One answer may be that the evolutionary process caused that pattern to be deleted from the sequence, thus shortening it. Another possibility is that the pattern degenerated by mutations or our search criteria are too stringent: so, the pattern exists, but we are not able to detect it. To take into account both possibilities, we choose to generalize the interval condition to possibly non-adjacent patterns P i and P i+h, for 1 i < k and 1 h min{k i, k q + 1}, as follows: i+h 1 r=i i+h 1 l r (x i+h ) 1 (x i ) 2 u i + r=i+1 ( P r + e r + u r ). (2) where P r + e r is the maximum length of a match of P r according to the model. We denote the set of these constraints by C. When h = 1 constraint (2) reduces to (1), so C Adj C. Besides, given a sequence of variables x i, x i+1,..., x i+j 1, x i+j, for some i and j, if C Adj [x i,..., x i+j ] is satisfied by some tuple (r i,..., r i+j ) D i D i+j, then each C[x i1, x i2 ], with i i 1 < i 2 i + j, is automatically satisfied 2 by the values r i1 and r i2. Note that, for any distinct pair of variables, there is at most one constraint involving those variables. The problem of finding q-occurrences of a given structured model is equivalent to finding the solutions of all the C[X] s such that X is a subsequence of variables x i1,..., x il, with l q. CSPs similar to this one, but over continuous intervals, have been investigated in [5], where they are called simple temporal problems (STPs). The approach in [5], however, does not work when finite domains are considered in place of intervals. The special form of our constraints over finite domains allows for a more efficient solution. Let M = k i=1 m i be the total number of occurrences of all the component patterns and let r ij D i denote the jth occurrence of the ith pattern, for 1 i k and 1 j m i. The constraint graph for C is a directed acyclic graph with M nodes labelled by the r ij s such that there is an edge from r ij to r (i+h)l, with 1 h min{k i, k q+1} and 1 l m i+h, if and only if the constraint C[x i, x i+h ] is satisfied when instantiated with the values r ij and r (i+h)l. The set of nodes of the graph can be partitioned into k layers D 1,..., D k, each one corresponding to the (ordered) set of occurrences of a component pattern. To emphasize that, we denote the graph with ((D 1,..., D k ), E), where k i=1 D i is the set of vertices and E is the set of edges. The edges connect only vertices in layers D i and D j where 1 j i k q + 1, and they are always directed towards a layer with greater index. We denote an edge from node r to node r with r r. Figure 2 shows an example of such graph. 2 We assume that the empty set of constraints is trivially satisfied. 8

9 The nodes can be totally ordered according to the start positions of the labelling occurrences, but for our purposes it is more useful to restrict such ordering to nodes in the same layer. So, we define a partial ordering over k i=1 D i { } as follows: given two nodes r 1 = (s 1, f 1 ) and r 2 = (s 2, f 2 ), we say that r 1 precedes r 2, written r 1 r 2, if r 1 and r 2 belong to the same layer and s 1 < s 2. Besides, r for every r. We write r 1 r 2 if either r 1 r 2 or r 1 = r 2. The immediate successor of a node r with respect to is written succ(r). We define succ( ) =. To build the graph it is not necessary to store all the edges explicitly. If r r 1, r r 2 E are two edges of the constraint graph and r 1 r 2, then for all nodes r 3 such that r 1 r 3 r 2 we have r r 3 E (see Lemma 5.1). This suggests that every node needs to have at most 2(k q+1) outgoing edges to the successive k q +1 layers. For each layer, the first edge goes into the occurrence with the lowest start position that satisfies the corresponding constraint and the second edge goes into the occurrence with the highest start position that satisfies the constraint. We call these two edges the low edge and the high edge, respectively. The remaining edges are implicitly determined. Given a low edge r r, where r D i and r D i+j, we denote r with r.low j and we say that r r is the low j-edge of r. Similarly, if r r is a high edge, we denote r with r.high j, and we say that r r is the high j-edge of r. If a node r has no outgoing j-edges for some j, we define r.low j = r.high j =. We call the representation of the graph by means of low and high edges the implicit representation of the (explicit) constraint graph (see Figure 3). We denote the set of low and high edges with Ẽ. The construction of the implicit graph can be done in O(M) time and space if the occurrences of the P i s are output in increasing order by the pattern matching algorithms (which is usually the case). Under the assumption that the nodes in each layer are sorted, low and high edges between layers D i and D i+h can be determined in O(m i + m i+h ) time by making two linear searches: the first search determines all the low edges and the second determines all the high edges. Algorithm 1 gives a description of the procedure. The two innermost for loops (lines 9 17 and 20 29) take O(m i + m i+h ) time for each pair of layers D i and D i+h to add O(m i ) edges. The search must be done for k q+1 h=1 (k h) pairs of layers. The overall time complexity in the worst case is therefore O ( k q+1 k h h=1 i=1 (m i +m i+h ) ) = O ( 2 k q+1 k h=1 i=1 m ) i = O(2(k q +1)M) = O(M). The total number of edges is bounded by 2(k q + 1)M, so the space requirement is also O(M). 5.3 Some properties of the constraint graph Given two nodes r 1 and r 2, the interval [r 1, r 2 ] is the set of nodes [r 1, r 2 ] = { r r 1 r r 2 }. The nodes r 1 and r 2 are the endpoints of the interval. Given a node r D i and an integer j, the interval int j (r) = [r.low j, r.high j ] is said to be induced by r. If r has no outgoing j-edges, then int j (r) = [, ] =. Of course, each node induces at most k q + 1 non-empty intervals. The basic property of induced intervals is expressed by the following simple lemma. Lemma 5.1 Let r D i. For each r int j (r), the constraint C[x i, x i+j ] is satisfied when instantiated with (r, r ). 9

10 Algorithm 1 Constraint Graph Construction Require: occurrences in each D i must be sorted by increasing start position. 1: Build-Constraint-Graph ( { P i, e i, D i } 1 i k, {[l i, u i ]} 1 i k 1, q ) 2: Let (D 1,..., D k ) be the (layered) set of nodes of the graph 3: Ẽ 4: for i 1 to k 1 do {For each layer but the last} 5: for h 1 to min{k i, k q + 1} do {and for each reachable layer} 6: Let L and U be the lower and upper bound, respectively, on the relative distance between two occurrences of P i and P i+h 7: {Determine low edge target} 8: l 1 {Low edge pointer} 9: for j 1 to m i do {For each node in layer D i } 10: while l m i+h and s (i+h)l f ij < L do 11: l l : end while 13: if l m i+h and s (i+h)l f ij U then 14: {Add low edge to (i + h)th layer} 15: Ẽ Ẽ {(s ij, f ij ) (s (i+h)l, f (i+h)l )} 16: end if 17: end for 18: {Determine high edge target} 19: u m i+h {High edge pointer} 20: for j m i to 1 do 21: while u > 0 and s (i+h)u f ij > U do 22: u u 1 23: end while 24: if u > 0 and s (i+h)u f ij L then 25: {Add high edge to (i + h)th layer} 26: Ẽ Ẽ {(s ij, f ij ) (s (i+h)u, f (i+h)u )} 27: end if 28: end for 29: end for 30: end for 31: return the implicit constraint graph G = ( (D 1,..., D k ), Ẽ) Proof. By construction, the pairs r, r.low j and r, r.high j both satisfy C[x i, x i+j ], i.e. they satisfy (2). By hypothesis, r.low j r r.high j. So we have (r.low j ) 1 (r) 2 (r ) 1 (r) 2 (r.high j ) 1 (r) 2. Therefore, the pair (r, r ) satisfies C[x i, x i+j ], too. We define two relations over intervals: given I 1 = [r 1, r 2 ] and I 2 = [r 3, r 4 ], we say that I 1 is before I 2, written I 1 B I 2, if r 2 r 3 ; we say that I 1 overlaps I 2, written I 1 O I 2, if r 1 r 3 r 2 r 4. According to this definition, the overlapping relation is neither symmetric nor transitive, but it is reflexive. The following lemma expresses an important property of the graph, which we extensively make use of in the development of our algorithms. 10

11 r 11 r 12 D 1 r 21 r 22 r 23 r 24 D 2 r 31 r 32 r 33 r 34 D 3 r 41 D 4 r 51 r 52 r 53 D 5 Figure 2: Table 2. The explicit constraint graph for q = 4, derived from the example in r 11 r 12 D 1 r 21 r 22 r 23 r 24 D 2 r 31 r 32 r 33 r 34 D 3 r 41 D 4 r 51 r 52 r 53 D 5 Figure 3: The implicit constraint graph, where non-feasible nodes have been grayed. When a low and a high edge of a node coincide, only one edge has been drawn. Lemma 5.2 Suppose that no match of a component pattern is contained into another. Let r 1 and r 2 be two nodes of the constraint graph such that r 1 r 2. If, for a given j, int j (r 1 ) then either int j (r 1 ) O int j (r 2 ) or int j (r 1 ) B int j (r 2 ). Proof. If r 1 = r 2 then int j (r 1 ) O int j (r 2 ) for any j. So, suppose that r 1 r 2. If int j (r 2 ) =, then int j (r 1 ) B int j (r 2 ). Otherwise, by hypothesis, we have L (r 1.low j ) 1 (r 1 ) 2 (r 1.high j ) 1 (r 1 ) 2 U and L (r 2.low j ) 1 (r 2 ) 2 (r 2.high j ) 1 (r 2 ) 2 U, where L and U are computed as specified by (2). Let r 1 = (s 1, f 1 ), r 2 = (s 2, f 2 ), r 1.low j = (s 3, f 3 ), r 1.high j = (s 4, f 4 ), r 2.low j = (s 5, f 5 ) and r 2.high j = (s 6, f 6 ). The above inequalities can be rewritten as L s 3 f 1 s 4 f 1 U and (3) L s 5 f 2 s 6 f 2 U. (4) 11

12 Since no match is contained into another, we also have f 1 f 2. Therefore, L s 5 f 1. Since (s 3, f 3 ) is, by hypothesis, the least occurrence satisfying the lower bound in (3), we must conclude that s 3 s 5, so r 1.low j r 2.low j. A similar reasoning applies to s 4 and s 6 : from (3) and by using the hypothesis, we get s 4 f 2 U. Since (s 6, f 6 ) is the maximum occurrence satisfying the upper bound in (4), we must conclude that s 4 s 6, and so r 1.high j r 2.high j. Therefore, if r 2.low j r 1.high j, we have int j (r 1 ) O int j (r 2 ), otherwise int j (r 1 ) B int j (r 2 ). Corollary 5.3 Suppose that no match of a component pattern is contained into another. Let r 1 and r 2 be two nodes of the constraint graph such that r 1 r 2. If r 1 has outgoing j-edges, then r 1.low j r 2.low j and r 1.high j r 2.high j. Correctness of Algorithm 1 is based on Corollary 5.3. It means that, in order to draw low (resp., high) j-edges, it is sufficient to scan the nodes located j layers below from left to right (resp., from right to left), without ever going back, i.e. a linear search will do. It is less obvious that also the nodes whose edges go into a given node r form an interval. This is what the following lemma states. Lemma 5.4 Suppose that no match of a component pattern is contained into another. Let r 1, r 2 and r 3 be three nodes such that r 1 r 2 r 3. If int j (r 1 ) and int j (r 1 ) O int j (r 3 ), then int j (r 2 ), int j (r 1 ) O int j (r 2 ) and int j (r 2 ) O int j (r 3 ). Proof. By Corollary 5.3, it is sufficient to prove that int j (r 2 ). Let r 1 = (s 1, f 1 ), r 2 = (s 2, f 2 ), r 3 = (s 3, f 3 ) and r 3.low j = (s 4, f 4 ). By hypothesis, we have L s 4 f 3, where L has been computed according to the leftmost sum in (2). As the intervals induced by r 1 and r 3 overlap, then r 3.low j int j (r 1 ), so, by Lemma 5.1, L s 4 f 1. Moreover, f 1 f 2 f 3, because no match is contained into another. Thus, L s 4 f 2. A similar reasoning allows us to prove that the upper bound is also satisfied. Then, r 2 must have a j-edge to r 3.low j, that is int j (r 2 ). Given a node r D i, let parent j (r) = { r r D i j r r E } be the set of nodes in layer D i j having an outgoing edge entering r (in the explicit graph). We define parent j (r) = if j i. We call parent j (r) the parent (j-)interval of r. The name is justified by the following corollary. Corollary 5.5 For any node r and integer j, parent j (r) is an interval. In a way similar to what we have done for low and high edges, one could also determine the endpoints of such intervals and store the corresponding edges delimiting parent intervals. It is not difficult to see that the properties we have proved for induced intervals hold for parent intervals, too. 5.4 How to output all the solutions Once the graph has been built, in order to get the solutions, i.e. all q-occurrences of the structured motif, it is sufficient to output all the paths (of the explicit 12

13 graph) whose length is at least q 1. Every such path corresponds to a valid match of the structured model. The problem of what to report when searching for complex patterns such as regular expressions has been faced by Myers et al. in [18]. In principle, the algorithms they propose can be applied also to the case of structured motifs (which are but a restricted form of regular expressions). Using the terminology of Myers et al., component patterns can be seen as the tagged subexpressions of the translation of the structured motif into a regular expression. The problem with the above mentioned approaches is that, in the worst case, the number of possible matches is exponential in the number k of component patterns. Although the constraint graph is, in practical cases, sparse and so computing all the paths is not impractical we can definitely do better than that. The idea is that we can give a suitable transformation of the graph as a convenient and compact output, which can be computed in time proportional to the size of the graph. Let us fix some terminology and notation. A source is a node that, in the explicit version of the graph, has no incoming edges. A leaf is a node that has no outgoing edges. We denote a path from r to r with r r. In the following, we always refer to paths in the explicit representation of the graph. Given a node r, let L r be the length of the longest path from a source to r, and let L r be the length of the longest path from r to a leaf (the length of a path being the number of its edges). A source is any node for which L r = 0. Similarly, a leaf is any node for which L r = 0. We say that a node is feasible if L r + L r q 1. A feasible node represents an occurrence of a component pattern that certainly belongs to an approximate or exact match to the structured motif. Conversely, a node that is not feasible corresponds to an occurrence that cannot belong to any solution. If we are able to modify the low and high edges in a way that they always point to feasible nodes, that is if we are able to shrink each induced interval [r 1, r 2 ] to a maximal subinterval [r 1, r 2] such that r 1 and r 2 are feasible nodes, then the solution can be implicitly characterized by the subgraph of the modified graph restricted to the set of feasible nodes. Indeed, this is true only if q = k: if q < k, not necessarily every path from a source to a leaf spells a valid match, so some other manipulation of the graph is mandatory. The first problem to solve is how to compute L r and L r. A naïve approach consists in propagating the values from one layer to another following induced intervals or parent intervals, respectively, i.e. in making a breadth first visit. But this is quadratic in the number of nodes. However, we can exploit the fact that valid paths have a limited range of lengths to avoid unnecessary updates. For a feasible r D i, it must be max{0, i 1 (k q)} L r i 1 and max{0, q i} L r k i. We call such values the feasible values for nodes in layer D i. Now, suppose that all the lengths are initialized to zero. The values associated with a node need to be updated only if they can receive a greater and feasible value. If a node r is assigned only feasible values, L r and L r must be updated at most a constant number of times, namely k q +1 times, because this is the number of different feasible values for each length. Algorithm 2 loops through all pairs of connected layers to compute the L r s bottom up. For each layer D i, starting from the last layer but one and going upwards, the adjacency sets of nodes r D i are examined and the maximum length value is determined. The adjacency sets are scanned by going through 13

14 layers D i+1,..., D i+k q+1 sequentially. For each pair D i and D i+j the procedure updates L r for each r D i if necessary, that is if the maximum of the length values in int j (r) is feasible and greater than L r (lines 15 16). Nodes in layer D i+j are scanned from left to right, and two pointers are kept: rightend is the first node in D i+j not already scanned, and rmax is the rightmost node (i.e., the maximum node with respect to ) in the interval int j (r) having the maximum length value if such value is feasible, otherwise it coincides with rightend. Of course, it is always rmax rightend. The inner if clause (lines 10 14) avoids redundant computation by skipping the interval of already scanned nodes whose maximum length value has already been computed. By doing so, nodes less than rmax are no longer taken into consideration. Let I 1 and I 2 be two induced intervals such that I 1 O I 2, and suppose that max and rmax have been computed for I 1. If rmax I 2, then the maximum length value in I 1 I 2 is max, so we only need to check the nodes in I 2 \ I 1. If rmax I 2, we must scan [succ(rmax), ] I 2 again, but we know that the maximum value in I 1 I 2 must be strictly less than max. So, if rmax for I 2 happens to be in I 1 I 2, then the corresponding max must be less than the previous value. This guarantees that a node is never scanned more than k q + 1 times. Thus, for each pair of layers D i and D i+j, the time needed to update the values in D i is O(m i + (k q + 1)m i+j ). This work must be done for k q+1 j=1 (k j) pairs of layers. Hence, the total time required by k j i=1 (m i + (k q + 1)m i+j ) ) = O(M). Compute-L r is O ( k q+1 j=1 The L r s can be computed by a symmetric procedure using parent intervals instead of induced intervals. However, since storing parent intervals doubles the required space for the graph, one can also devise a procedure propagating L r from each r to its induced intervals. Algorithm 3 shows how this can be done in linear time when q = k. The extension to the general case q k requires the use of k q + 1 pointers to nodes instead of only one, but the principle is the same. At the ith iteration, the variable ptr contains the first node in D i+1 that has not been scanned yet. Such pointer is used to avoid assignments to already examined nodes (lines 9 10). Thus, the time complexity of Compute- L r is O(M). As before, allowing q < k only adds a constant factor to such bound. After having computed path lengths, the graph can be restricted to feasible nodes. If a low edge enters a node that is not feasible, that low edge must be moved to the right, if possible. The same holds for a high edge, the only difference being that it must be moved to the left. In other words, we have to restrict the interval induced by a node to its maximal subinterval having feasible endpoints. Moving a low (resp., high) edge from a node to its successor (resp., predecessor) corresponds to deleting one edge in the explicit constraint graph. Such edge connects two occurrences of component patterns that locally satisfy their distance constraint, but one of them does not belong to any match of the structured motif. Algorithm 4 shows how this operation can be done for low edges. When a layer is processed and a feasible node r is encountered, all low edges pointing to non-feasible nodes before r are moved to target r. Such low edges can be easily retrieved if k q +1 pointers are maintained, each one performing a linear scan of one of the k q + 1 layers above the current layer. The overall time complexity is therefore O ( k i=2 (m i + k q+1 j=1 m i j ) ) = O(M). High edges can 14

15 Algorithm 2 Compute L r for all nodes r 1: Compute-L r ( G) 2: for all nodes r k i=1 D i do 3: L r 0 4: end for 5: for i k 1 to 1 do {For each layer but the last, in decreasing order} 6: for j 1 to min{k i, k q + 1} do {and for each next feasible layer} 7: rmax rightend the first node in D i+j 8: for all nodes r D i, in increasing order, do 9: if int j (r) then {If r has j-edges} 10: if rmax int j (r) then 11: let max be the maximum among L rmax and the values in int j (r) [rightend, ] 12: else 13: let max be the maximum among the values in int j (r) 14: end if 15: if max is a feasible value then 16: L r max{l r, max + 1} 17: rmax the rightmost node r int j (r) such that L r = max 18: else {Skip non feasible nodes} 19: rmax succ(r.high j ) 20: end if 21: rightend succ(r.high j ) 22: end if 23: end for 24: end for 25: end for Algorithm 3 Compute L r for all nodes r when q = k Require: q = k 1: Compute-L r ( G) 2: for all nodes r k i=1 D i do 3: L r 0 4: end for 5: for i 1 to k 1 do {For all layers, but the last, from top to bottom} 6: Let ptr be the first node in layer D i+1 7: for all nodes r D i, in increasing order, do 8: if L r = i 1 then {If r has a feasible length value} 9: for all nodes r [ptr, ] int 1 (r) do 10: L r i 11: end for 12: ptr succ(r.high 1 ) 13: end if 14: end for 15: end for 15

16 be processed by a symmetric procedure, examining nodes in decreasing order. Algorithm 4 Shrink induced intervals to have feasible left endpoints 1: Adjust-Low-Edges( G) 2: for i 2 to k do {For all layers but the first, in increasing order} 3: Let ptr be the first node in D i 4: for all nodes r D i, in increasing order, do 5: if r is feasible then 6: for all induced intervals [r 1, r 2 ] with ptr r 1 r r 2 do 7: replace [r 1, r 2 ] with [r, r 2 ] 8: end for 9: ptr succ(r) 10: end if 11: end for 12: change all remaining induced intervals to [, ] 13: end for r 11 r 12 D 1 r 22 r 24 D 2 r 32 r 33 D 3 r 41 D 4 r 51 r 52 r 53 D 5 Figure 4: The graph of Figure 3 restricted to feasible nodes. Let G = ( D 1,..., D k ), Ẽ ) be the implicit constraint graph after the execution of the previous algorithms, where each D i D i is the subset of feasible nodes of D i. If q = k, then every node in each layer induces a non-empty interval in the next layer. Therefore, all (explicit) paths starting from a node in the first layer reach a node in the last layer traversing all layers, i.e. each path represents a valid match of the structured motif. So, G is a compact representation of all the solutions, which can be given as a suitable output. If q < k, two situations must be considered. First, an edge can join two feasible nodes, but the corresponding occurrences must not necessarily belong to the same match of the structured motif (as for r 12 r 32 in Figure 4). Such an edge must be deleted. Second, although a valid path always exists that passes through a feasible node, not necessarily all the paths are long enough (see Figure 4). To detect these situations a visit of the graph is done. The kind of visit we need is a slightly modified depth first search, described in Algorithm 5. When a node belongs to a path that is not long enough (e.g., in Figure 4, r 33 belongs to r 12 r 33 r 51 ) but its path edges should not be deleted because 16

17 they are part of feasible paths (e.g., r 12 r 33 belongs to r 12 r 33 r 41 r 51 and r 33 r 51 belongs to r 12 r 24 r 33 r 51 ), that node must be cloned in order to eliminate the unfeasibility. Since the path lengths are in a limited range, a node must be duplicated at most a constant number of times. Let G = (V, E ) be the graph produced by Constraint-Graph-Visit. The nodes in G are pairs (r, l) where r is a node of G and l is the length of a path reaching r from a source. When a node r is visited through a path of length l, a node r adjacent to r is visited only if the current path can be extended to a feasible path by going through r (line 4 in procedure Visit) and if r has not yet been reached by a path of length l + 1 (lines 5 6). So, a node can be visited (and then duplicated) as many times as the number of paths with different lengths that enter that node, that is at most k q + 1 times. Therefore, the number of nodes in the output graph is still O ( k i=1 i) D (which is O(M), although, in typical practical situations, k i=1 D i M), and the time complexity of the visit is asymptotically the same as a standard depth first search, that is, in the worst case, O ( ( k i=1 D i )2). This is the only quadratic step in our method. Finally, Algorithm 5 reduces to a standard depth first search of G when q = k. The conditional statement in line 4 of procedure Visit avoids following edges that lead to paths that are necessarily too short, so it permits to eliminate all non-feasible paths. It can be easily verified that, for every path r 1 r 2 r n from a source to a leaf of (the explicit version of) G, there are two possibilities: either n < q, in which case there is no corresponding path in the output graph, or n q, in which case there is a path (r 1, 0) (r 2, 1) (r n, n) in the output graph. Figure 5 shows the output of our running example. Algorithm 5 Visit of the modified graph restricted to feasible nodes 1: Constraint-Graph-Visit( G ) 2: V E 3: for all sources s do 4: Visit(s, 0) 5: end for 6: return (V, E ) 1: Visit(r, l) 2: V V {(r, l)} 3: for all nodes r k q+1 j=1 int j (r) do {Explore r s adjacency set} 4: if l L r q 1 then {If there is a feasible path from r } 5: if (r, l + 1) V then 6: Visit(r, l + 1) 7: end if 8: E E {(r, l) (r, l + 1)} 9: end if 10: end for 17

18 40, ACA 12, TGT 28, CT 32, GG 35, TA 44, ACA 49, ACA 44, ACA 1, TGT 15, CT 22, GG 32, GG 35, TA 49, ACA 40, ACA Figure 5: The output graph derived from the visit of the graph in Figure 4. 6 Experimental results We have written a program, called SMaRTFinder 3. We have implemented an exact pattern matching algorithm using Kurtz s implementation of suffix trees [9], and Sellers algorithm for approximate pattern matching ([10]). Both algorithms can work with the IUPAC alphabet 4. The program is written in standard C++ (apart from Kurtz s code, which is in C). Since Sellers algorithm is not optimal, being quadratic in the length of the text to be scanned, we did not include it in our tests. We plan to use more efficient algorithms in the future. The following test was executed on a PowerPC G4 400Mhz machine with 384Mb RAM running Mac OS X. The program was compiled with g++ v3.3 with the option -O3. A set of 1000 structured models over Σ IUPAC were randomly generated by randomly choosing, for each model, the number k [3, 8] of component patterns, the length l [5, 10] of each component pattern and k 1 subintervals of [0, 100] as gaps. We measured the performance of SMaRTFinder over such sample by processing a 5Mb DNA sequence belonging to chromosome I of Arabidopsis Thaliana to search for k-occurrences of the structured motifs with no errors. Since Kurtz s suffix trees can be built either lazily or eagerly, we tested both cases. We then ran Anrep [16], compiled with gcc v3.3 with flag -O3, over the same set of patterns and the same sequence. The results of this experiment are shown in Figures 6 and 7. Two considerations can be made. First of all, the time needed to build the suffix tree lazily is quickly amortized, so there is no big difference, when many searches are performed on the same sequence, in having a pre-constructed suffix tree or building it incrementally. Second, SMaRTFinder outperforms Anrep in most cases, and it has a much more stable linear behaviour. The running time of Anrep very strongly depends on the statistical technique used to determine the best backtrack order of the search (see [17]). In cases where such strategy is effective (for instance, when the structured motif has one long rare component pattern and the remain- 3 where SM stands for structured motif, and RT reminds of our motivating problem, i.e. finding retrotransposons. 4 Actually, they can work with any predefined scheme for comparing characters. 18

19 ing patterns are very short) Anrep produces better results, but Figures 6 and 7 show that it is usually much slower than our program. 40 Eager SMaRTFinder Anrep Time (s) Number of matches Figure 6: Eager SMaRTFinder compared with Anrep. 40 Lazy SMaRTFinder Anrep Time (s) Number of matches Figure 7: Lazy SMaRTFinder compared with Anrep. As a second experiment, we processed the whole genome of Arabidopsis Thaliana (see [6]) searching for 4-, 5- and 6-occurrences of the structured model 19

20 TNGA[12,14]TWNYTNNA[19,21]TNTMYRT[4,6]WNCCNNNNRG[72,95]TGNNA[100,125]TNTANRTNRAYGA Figure 8: A very well conserved feature of a Copia retrotransposon. shown in Figure 8, which we had obtained by the multiple alignment of several BAC clones of the rice genome (Oryza sativa). Arabidopsis Thaliana was chosen because it is a well studied and annotated genome. This test was done on a Pentium IV 1.6Ghz machine with 512Mb RAM running Linux. Table 3 shows the results, averaged over 10 trials. It can be noticed that allowing missing patterns does not affect the running time too much. The whole genome can be processed in less than 100 seconds, and this time reduces by a factor of 10 if the suffix trees of the sequences are already available. The best qualitative results were obtained allowing one missing pattern, the number of found occurrences being of the same order of magnitude as other results in the literature (e.g., [6]). Besides, in that case we found no false positives, i.e. all found elements we checked are annotated as retroelements, or at least as genes, in current databases. The reason for that lies, of course, in the good quality of the structured model we used. We can draw the following conclusion: when patterns are specified in the IUPAC alphabet and we can combine a certain number of them into a structured motif, the search can be effective even if we do not allow errors inside each pattern (or we allow a small number of errors, e.g. one error). Effectiveness comes out from the combination of the patterns: in this perspective, allowing missing patterns is a valuable feature. Eager Lazy Sequence Size 6/6 5/6 4/6 6/6 5/6 4/6 chr1at 29 Mb chr2at 19 Mb chr3at 22.7 Mb chr4at 16.9 Mb chr5at 25.7 Mb Table 3: Running time of SMaRTFinder for different kinds of search for the structured motif of Figure 8. Time is expressed in seconds. Times for chr1at in the eager case are missing, because the program went out of memory. Nonetheless, the sequence could be processed lazily. 7 Conclusions Our aim was to develop an algorithm to search whole genomes for structured motifs, having the feature of identifying partial matches. We believe that the constraint graph provides the ground for a non-trivial user interface for an evaluation of the results of the search. We are currently developing a graphical interface for presenting the output to the user in an attractive and meaningful way. 20

String Matching with Variable Length Gaps

String Matching with Variable Length Gaps Philip Bille, Inge Li Gørtz, Hjalte Wedel Vildhøj, and David Kofoed Wind Technical University of Denmark Abstract. We consider string matching with variable length