Sparse RNA Folding: Time and Space Efficient Algorithms

Size: px
Start display at page:

Download "Sparse RNA Folding: Time and Space Efficient Algorithms"

Transcription

1 Sparse RNA Folding: Time and Space Efficient Algorithms Rolf Backofen 1, Dekel Tsur 2, Shay Zakov 2, and Michal Ziv-Ukelson 2 1 Albert Ludwigs University, Freiburg, Germany backofen@informatik.uni-freiburg.de 2 Department of Computer Science, Ben-Gurion University of the Negev, Israel {dekelts}, {zakovs}, {michaluz}@cs.bgu.ac.il Abstract. The classical algorithm for RNA single strand folding requires O(nZ) time and O(n 2 ) space, where n denotes the length of the input sequence and Z is a sparsity parameter which satisfies n Z n 2. We show how to reduce the space complexity of this algorithm. The space reduction is based on the observation that some solutions for subproblems are not examined after a certain stage of the algorithm, and may be discarded from memory. This yields an O(nZ) time and O(Z) space algorithm, which outputs both the cardinality of the optimal folding as well as a corresponding secondary structure. The space-efficient approach also extends to the related RNA simultaneous alignment with folding problem, and can be applied to reduce the space complexity of the fastest algorithm for this problem from O(n 2 m 2 ) down to O(nm 2 + Z), where where n and m denote the lengths of the input sequences to be aligned (w.l.o.g. m n), and Z is a sparsity parameter which satisfies nm Z n 2 m 2. In addition, we also show how to speed up the max base-pair variant of RNA single strand folding. The speed up is achieved by combining two independent existing techniques, which restrict the number of expressions that need to be examined in bottleneck computations of these algorithms. This yields an O(LZ) time and O(Z) space algorithm, where L denotes the maximum cardinality of a folding of the input sequence. 1 Introduction The structure of RNA is evolutionarily more conserved than its sequence and is thus key to its functional analysis [3]. Unfortunately, although massive amounts of sequence data are continuously generated, the number of known RNA structures is still very limited since experimental methods, such as NMR and Crystallography, require expertise and long experimental time. Therefore, computational methods for predicting RNA structures are of great value [20, 8, 19]. RNA is typically produced as a single stranded molecule, which then folds upon itself to form a number of short base-paired stems. This base-paired structure is called the secondary structure of the RNA. Paired bases almost always occur in a nested fashion in RNA secondary structure. Under the assumption

2 that the structure does not contain pseudoknots (i.e. crossing base pairs), a model was proposed by Tinoco et al. [9] to calculate the stability (in terms of free energy) of a folded RNA molecule by summing all contributions from the stabilizing, consecutive base pairs and from the loop-destabilizing terms in the secondary structure. Based on this model, dynamic programming algorithms were suggested for computing the most stable structures [16, 14, 21, 1, 17], applying various scoring criteria such as the maximal number of base pairs [14] or the minimal free energy [21]. This optimization problem is formally denoted Single RNA strand folding, and the time and space complexity of the classical algorithms for solving it is O(n 3 ) and O(n 2 ), respectively, where n denotes the length of the input RNA sequence. Recently, these were sped up to yield O(nZ) time and O(n 2 ) space [17] algorithms, where Z is a sparsity parameter which satisfies n Z n 2. We note that these algorithms are practical in the sense that the hidden constants are small. On the more theoretical front, Akutsu suggested an O(D(n)) algorithm for this problem [1], where D(n) is the time for computing the distance product of two n n matrices. The best current bound on D(n) is O(n 3 log 3 log n/ log 2 n) [2]. Another approach to RNA folding is the Sankoff simultaneous alignment with folding (SAF for short) [15, 13, 11, 12, 18]. This approach consists of finding an optimal alignment between a set of RNA sequences, where an alignment score is evaluated with respect to some common folding of the input sequences. However, as stated in [5], even for the simple case where the input consists of only two sequences, this approach requires extreme amounts of memory and space with complexity of O(n 2 m 2 ) space and O(n 3 m 3 ) time, where n and m are the lengths of the input RNA sequences to be aligned. Thus, most existing practical implementations of this algorithm [13, 11, 18] use restricted versions of the original problem. Since these restrictions introduce another source of error, it is of utmost practical importance to the research on RNA to improve both the space and time complexities of the full versions of RNA folding. A first non-heuristic speedup, which does not sacrifice the optimality of results, was recently described in [12]. This work extends the approach of [17] and yields an O(nm Z) time and O(n 2 m 2 ) space algorithm for RNA simultaneous alignment with folding, where Z is a sparsity parameter which satisfies nm Z n 2 m 2. However, experimental analysis of this algorithm indicates that the high memory requirements pose a major bottleneck in practice, both in constraining the lengths of the input sequences to be aligned, as well as in exhausting the benchmark machine s memory, which in turn results in a page-fault slowdown. Our contribution (1.) Reducing the space requirements of both RNA folding problems. In this work we focus on improving the space complexity of the base-pairing maximization variant of the RNA single strand folding problem [16, 14, 1]. The space requirement reduction is based on the observation that some solutions for subproblems are not examined after a certain stage of the algorithms, and may be discarded from memory. This yields an O(nZ) time and O(Z) space algorithm for this

3 Single strand base-paring maximization Single strand energy minimization Simultaneous alignment with folding Previous results New results Time Space Time Space O(n 3 )[14] O(nZ)[17] O(n 2 ) O(LZ) O(Z) O(D(n))[1] O(n 3 )[21] O(nZ)[17] O(n 2 ) O(nZ) O(Z) O(n 3 m 3 )[15] O(nm Z) [12] O(n2 m 2 ) O(nm Z) O(nm 2 + Z) Table 1. Time and space complexities of RNA folding algorithms. problem. In addition to the optimal folding cardinality computation, we show a trace-back procedure which outputs a corresponding secondary structure. Note that it is an interesting challenge on its own to recover an optimal folding within the time and space complexity bounds of the space-reduced algorithm, since due to the sparse representation only partial information is kept. The presented strategy may also be extended to the score computation of a family of RNA folding algorithms, which includes algorithms for the energy minimization variant of the single strand folding problem [21, 17] (improving the space complexity from O(n 2 ) to O(Z)), as well as algorithms for SAF [15, 12] (improving the space complexity from O(n 2 m 2 ) to O(nm 2 + Z)). (2.) A sparse RNA single strand folding algorithm. We also describe a fast algorithm for the max base-pair variant of RNA single-strand folding that exploits an additional sparsity parameter, based on the cardinality of the optimal folding. This is achieved by combining two independent techniques, which were previously used to reduce the number of sub-instance pairs that need to be considered by the algorithm. This combination yields the simultaneous exploitation of two key properties emerging from the formal definitions of these folding problems: the triangle inequality property, previously exploited in [17] and [12], as well as the monotonicity and unit-step properties, previously utilized in [10] for a related problem. The result is an O(LZ) time and O(Z) space algorithm, where L denotes the maximum cardinality of a folding of the input sequence and n Z nl. We note that the algorithms described here are practical in the sense that the hidden constants are small. In the context of practical contribution, we also point out that our space complexity improvements are more significant than the time complexity improvements since while the expected value of L is Θ(n) (assuming uniform character distribution), both Z and Z were experimentally shown to be significantly less than n 2 and n 2 m 2, respectively [17, 12]. Furthermore, reducing the space complexity of the simultaneous alignment with folding problem is a key result in practice, as in the previous results the space complexity was typically the computational bottleneck [5, 12].

4 RoadMap: The rest of the paper proceeds as follows. Preliminary notation and definitions are given in Section 2. The Nussinov-Jacobson [14] recurrence for computing maximum cardinality folding of an RNA sequence, and the application of the elimination technique of Wexler et al. [17] to this recurrence are described in Section 3. The latter technique is used in Section 3.2 for designing a space efficient algorithm for the single strand RNA folding problem. In the following Section 3.3, we show how to implement a space efficient traceback procedure in order to reconstruct an optimal folding from this sparse, space efficient representation. Then, in Section 3.4, the space reduction approach is shown to scale up to the RNA simultaneous alignment with folding problem. In Sections 4.1 we apply another elimination technique, which is similar to a technique of Jansson et al. [10], to further reduce the algorithm s time complexity, without increasing its space complexity. We conclude with Section 4.2, where this technique is further improved by step encoding. Due to space constraints, some figures, pseudocode and proofs are omitted, and may be found in the supporting on-line material document at (TODO:link). 2 Preliminaries An RNA sequence is a sequence over the alphabet {A, C, G, U}. Each letter in an RNA sequence is also called a base or a nucleotide. The bases A and U are called complementary bases, and so are the bases C and G 3. For a base σ {A, C, G, U}, denote by σ the complementary base of σ. Fix henceforth an RNA sequence S = s 1 s 2 s n. Denote by S i,j the subsequence s i s j of S, where S i,i 1 is defined to be the empty sequence. Definition 1 (RNA folding). A folding F of a subsequence S i,j is a set of index pairs that satisfies the following: 1. For every (k, l) F, i k < l j, and s l = s k. 2. There are no (k, l), (k, l ) F, such that k k l l. A pair (k, l) F is called a base-pair. Denote by F the size of a folding F, i.e. the number of base-pairs in F. Say that index k is paired in a folding F if k appears in a base-pair in F (either as the first or the second base in the pair), otherwise k is unpaired in F. The single strand base-pairing maximization problem was first addressed in [14]. The formal problem definition is given below. Problem 1 (Single Strand Base-Paring Maximization). Compute the maximum size of a folding of the instance sequence S. Definition 2 (branch point). Let F be a folding of some subsequence S i,j. An index q is a branch point with respect to F if for all (k, l) F, either l < q or k q. 3 For the sake of clarity, we disregard in this work the possible wobble pairing between G and U. All presented results may be easily extended to include G U pairing as well.

5 We distinguish between two kinds of foldings. For a folding F of S i,j, say that 1. F is co-terminus if F includes the base-pair (i, j). 2. F is partitionable if F does not include the base-pair (i, j). Note that for j > i, F is partitionable if and only if F has a branch point i < q j. Definition 3. For a subsequence S i,j, denote: 1. L (i, j) is the maximum size of a folding of S i,j. 2. L p (i, j) is the maximum size of a partitionable folding of S i,j. 3. L c (i, j) is the maximum size of a co-terminus folding of S i,j, or if there is no such folding (if j i or s j s i ). Call a folding F of S i,j for which F = L (i, j), an optimal folding of S i,j. In the rest of this paper, we will use L instead of L (1, n) whenever the context is clear. 3 RNA Folding via base-pairing maximization In this section we describe the recursive solution for the single RNA strand folding base-pair maximization problem, and some of its characteristics. We then present a technique for reducing its space complexity. Note that this technique easily extends to the single-strand RNA folding algorithms that are based on a thermodynamic model [20, 8, 19]. Finally, we show how to extend the spacereduction technique and apply it to the simultaneous alignment with folding algorithms [15, 12]. 3.1 A recursive solution For a subsequence S i,j such that j i, the only possible folding is the empty folding, and therefore L (i, j) = 0. The following equations show how to recursively compute L (i, j) when j > i: L (i, j) = max {L p (i, j), L c (i, j)}. (3.1) { L c L (Si+1,j 1 ) + 1, s (i, j) = j = s i, (3.2), s j s i. L p (i, j) = max i<q j {L (S i,q 1) + L (S q,j )}. (3.3) Note that the time complexity bottleneck in algorithms which implement the recursive computation of Equations 3.1 to 3.3 is due to the consideration of O(n) branch points q in the computation of L p (i, j), according to Equation 3.3. In the rest of this section, as well as in Section 4, we describe techniques that reduce the number of branch points that need to be examined in this computation, and thus improve the time complexity of such algorithms. Due to Equations 3.1 and 3.3, the following (inverse) triangle inequality is sustained in the base-paring maximization problem:

6 Observation 1 (triangle inequality) For every subsequence S i,j and for every i < q j, L (i, j) L (i, q 1) + L (q, j). Based on the triangle inequality, Wexler et al. [17] observed that it is sufficient to examine only a subset of the branch points in order to compute L p (i, j). We present here a slightly different notation for the same concept. Definition 4 (OCT). A subsequence S i,j is optimally co-terminus (OCT) if i = j, or if every optimal folding of S i,j is co-terminus (that is, if L (i, j) = L c (i, j) > L p (i, j)). Call an index q for which L p (i, j) = L (i, q 1) + L (q, j) an optimal branch point with respect to S i,j. Lemma 1 (Wexler et al. [17]). For every subsequence S i,j, there is an optimal branch point q with respect to S i,j such that S q,j is an OCT. For completeness, we give a proof for Lemma 1 with the notation of the current work. Proof. Let q be the maximum index such that L p (i, j) = L (i, q 1) + L (q, j). If q = j, then S q,j = S j,j is an OCT by definition. Otherwise, let q be any index in the range q < q j. From the selection of q, L (i, q 1) + L (q, j) = L p (i, j) > L (i, q 1)+L (q, j) Obs. 1 L (i, q 1)+L (q, q 1)+L (q, j). Therefore, L (q, j) > L (q, q 1) + L (q, n) for every q < q j, and L p (q, j) max L (q, q<q j q 1) + L (q, j) < L (q, j). Thus, S q,j is an OCT. Define the following subset of branch points with respect to S i,j : Q i,j = {i < q j : S q,j is an OCT}. Eqn. 3.3 = The following equation restates Equation 3.3, based on Lemma 1, by restricting the branch points considered by the maximization term to those in Q i,j. L p (i, j) = max q Q i,j {L (i, q 1) + L (q, j)}. (3.4) We define the following sparsity measure of RNA sequences. Definition 5. For a subsequence S i,j, Z(i, j) is the number of subsequences of S i,j which are OCTs. In the rest of this paper, we use Z instead of Z(1, n) whenever the context is clear. In the sparse case, only a small portion of the O(n 2 ) subsequences of S are OCTs. In Section 4.1 we show that, in the maximum base-pair variant of RNA single-strand folding, Z is bounded by n(l + 1). For the minimum free energy problem variant, an estimation of the expected value of a parameter related to Z, based on a probabilistic model for polymer folding and measured by simulations, which shows that that Z is significantly smaller than O(n 2 ), can be found in [17].

7 Previous algorithms for the base-pairing maximization problem were presented by Nussinov and Jacobson [14] and by Wexler et al. [17] 4. Both algorithms are dynamic programming algorithms that perform a bottom-up computation of the recurrence described in this section, where the Nussinov-Jacobson algorithm uses Equation 3.3 for the computation of L p (i, j), and the Wexler et al. algorithm improves it by using Equation 3.4. These algorithms compute the upper triangle of a table M n n, where each cell M[i, j] stores the value L (i, j). The entries of M are traversed in an order which guarantees that all values that are needed for the computation of M[i, j] = L (i, j), according to the recurrence formula, are computed and stored in M prior to the computation of M[i, j]. Upon termination, M[1, n] holds the value L. The time complexity of the algorithm by Nussinov and Jacobson is O(n 3 ), whereas that of the algorithm by Wexler et al. is O(nZ). Both algorithms use O(n 2 ) space. 3.2 A space efficient algorithm Our space reduction strategy is based on the observation that some of the values stored by the algorithm of Wexler et al. [17] are not necessary throughout the complete run of the algorithm. In the following lemma we characterize the values that need to be maintained in memory for the computation of L (i, j). Lemma 2. For a subsequence S i,j, it is possible to compute L (i, j) by examining only those values L (a, b), where i a < b j and b a < j i, which sustain that either a = i, a = i + 1, or S a,b is an OCT. Proof. Immediate from Equations 3.1, 3.2 and 3.4. Consider a dynamic programming algorithm which fills the table M by traversing its entries row by row from bottom to top, and each row from left to right. Lemma 2 implies that at the stage where M[i, j] is computed, it is sufficient to keep only the values in the currently computed i-th row, the values in the recently computed (i + 1)-th row, and values in entries which correspond to OCT subsequences of S. Thus, there is no need to maintain the complete table M in memory, rather, at each stage, entries which are guarantied not to be further examined by the algorithm may be discarded. This yields a total space complexity of O(n+Z) = O(Z). Note that the computation of each entry M[i, j] requires O( Q i,j ) operations, due to the consideration of the branch point set Q i,j (these sets are maintained as lists in order to allow an efficient traversal, n 1 n n 1 n n 1 as explained in [17]). Since Q i,j Q 1,j Z < nz, the i=1 j=i+1 i=1 j=i+1 running time of the algorithm is O(nZ). Fig. 1 illustrates a run of the above described algorithm, and Alg. 1 in the supporting on-line material gives the pseudo code. Its time and space complexities are summarized in the following lemma. 4 [17] deals with the more realistic energy minimization variant of the problem. For clarity, we project their notions on the simpler base-paring maximization variant discussed here. i=1

8 Lemma 3. Given an RNA sequence S of length n, there is an algorithm which computes L (1, n) in O(nZ) time and O(Z) space. Fig. 1. An exemplification of the space reduced single strand folding algorithm. The left figure shows the complete table M with respect to the RNA sequence S = ACAGUUGCA. Grayed entries correspond to OCT subsequences. The figure in the middle presents a snapshot of the entries maintained by the algorithm at the stage where M[2, 7] is computed. At this stage, the algorithm examines the sum of M[2, 6] and M[7, 7] to compute L p (2, 7) (note that Q 2,7 = {7}), and M[3, 6] + 1 to compute L c (2, 7) (since s 7 = s 2). Since L c (2, 7) = 2 > 1 = L p (2, 7), the subsequence S 2,7 is marked as an OCT. The right figure shows the computation of the next entry, M[2, 8]. Here, the evaluation of L p (2, 8) examines the sum of M[2, 6] and M[7, 8], and the sum of M[2, 7] and M[8, 8] (since Q 2,8 = {7, 8}). L c (2, 8) is not evaluated since s 8 s Folding reconstruction In addition to computing the optimal folding score of a given RNA sequence, it is often of interest to report at least one optimal folding, or a set of optimal or suboptimal foldings of the input sequence. Some well known standard techniques for reporting an optimal folding apply trace-back procedures over the folding score matrix M, in O(n 2 ) time [4]. In this section we show how to reconstruct one optimal folding, without exceeding the time and space complexities of our folding algorithm. Note that this is a challenging task, as the classical trace-back algorithm requires the availability of the full table M, while our algorithm stores only those entries of M which correspond to OCT subsequences. Assume that the full table M is given, with annotated OCT subsequences. The basic recursive folding trace-back reconstruction algorithm [4] could be modified as follows to utilize the OCT subsequences in the computation of the traceback of S i,j. 1. For j i, the only (optimal) folding of S i,j is the empty folding, and the algorithm halts without reporting any base-pair. 2. For j > i, if S i,j is an OCT, then every optimal folding of S i,j includes the pair (i, j), and therefore the algorithm reports the pair (i, j) and is called recursively on the subsequence S i+1,j 1.

9 3. Otherwise, S i,j is partitionable, and therefore the algorithm finds an index q Q i,j for which M[i, j] = M[i, q 1] + M[q, j] and then continues by recursively computing an optimal folding of S i,q 1 and of S q,j. An optimal folding of S i,q 1 is obtained by calling the trace-back algorithm recursively with the sub-instance S i,q 1. As for computing an optimal folding of S q,j, note that S q,j is an OCT, and consider the two cases, where either q = j or q < j. If q = j, then there is no need for another recursive call. Otherwise q < j, and an optimal folding of S q,j is obtained by first reporting the basepair (q, j) and then calling the trace-back algorithm recursively with the sub-instance S q+1,j 1. Time complexity analysis of the trace-back algorithm on the full table M. Observe that, when calling the above algorithm to compute the folding traceback of S i,j, recursive calls with three different subsequences could be initiated at the top level: S i 1,j 1, S q+1,j 1 and S i,q 1, and thus index j is eliminated from further consideration as an end index. Therefore, each recursive call is performed with a different end index j, and altogether there are at most n recursive calls in the whole computation. For a recursive call in which the end index is j, at most O ( Q 1,j ) operations are preformed in finding a q Q i,j for which M[i, j] = M[i, q 1] + M[q, j]. Since Q 1,j Z, the total running time is 1 j n O(Z). We next turn to address the challenge of reconstructing an optimal folding from the sparse table M computed in Section 3.2. The above described algorithm cannot be applied directly in this case, due to the fact that when the algorithm needs to find q Q i,j for which M[i, j] = M[i, q 1] + M[q, j], the values M[i, j] and M[i, q 1] may have been discarded from memory (while M[q, j] is maintained in memory since S q,j is an OCT). In order to overcome this difficulty we adopt a similar approach as of the algorithm of Hirschberg [6], namely performing on-demand value re-computations of discarded entries. Thus, it remains to show how to recover such deleted entries. Lemma 4. Given the sparse table M that contains optimal folding cardinalities for OCT subsequences, there is an algorithm which recovers the set of entries M[i, i + 1], M[i, i + 2],... M[i, j], for a pair of given indices i and j, in O(Z) time. Proof. The entries of the form M[i, j ] which have been discarded from memory correspond to partitionable subsequences, where L (i, j ) = L p (i, j ), and thus may be recomputed based solely on Equation 3.4. Observe that this computation examines only entries of the form M[i, q] for q < j, and M[q, j ] for OCT subsequences S q,j. Re-computing the entries of the ith row from left to right guaranties that upon computing M[i, j ], all necessary values for the computation of L p (i, j ) are already stored in M. For each i < j j, there are O( Q 1,j ) operations performed along this computation, due to the consideration of branch points in the set Q i,j. As before, summing this expression over all i < j j accumulates to O(Z).

10 We next show that, throughout the full run of the algorithm, the process of restoring row entries is applied to O(L) distinct start indices. Consider the case where the trace-back algorithm is applied on S i,j and assume that the set of entries M[i, i + 1], M[i, i + 2],... M[i, j] was already previously restored. Note that a recursive call with a sub-instance of the form S i,q 1 does not require the restoration of the entries M[i, i + 1], M[i, i + 2],... M[i, q 1], as (by the assumption) they have already been restored and are maintained in M. The other two possible recursive calls with sub-instances of the form S i+1,j 1 or S q+1,j 1, do require re-computation of entries in M (in rows i+1 or q+1, correspondingly). However, observe that each call of the latter kind is preceded by a detection of a base-pair. Since throughout the full run of the algorithm only L base pairs are detected, we get that the row entry recovery only needs to be executed L times (in addition to the recovery of M[1, 1], M[1, 3],... M[1, n] during initialization). Thus, according to Lemma 4, the entry value recovery contributes an additional O(LZ) factor to the total time complexity of the trace-back algorithm. Furthermore, note that upon performing such a re-computation of an entry set, there is no need to further maintain the values in M[i, i + 1], M[i, i + 2],... M[i, j] in the case where S i,j is co-terminus, nor to keep the values in M[i, q], M[i, q + 1],... M[i, j] in the case where S i,j is partitionable. This allows to discard these values from memory before the re-computation of the entry set for the corresponding sub-instance, guaranteeing that at each stage, at most n recovered entries are maintained in the sparse table M, in addition to the already existing OCT corresponding entries. Therefore, the space complexity of the trace-back algorithm remains O(Z + n) = O(Z). Alg. 2 in the supporting on-line material implements the efficient trace-back scheme. Lemma 5. There is an algorithm which, given the sparse table M that contains folding scores for all OCT subsequence of S, computes an optimal folding of S in O(LZ) time and O(Z) space. 3.4 Extending the space reduction to Simultaneous Alignment with Folding The goal of the SAF problem is to find a multiple sequence alignment and a common folding of the aligned sequences, which optimizes some score function. For simplicity, we assume the problem instance consists of two sequences. Similarly to single RNA strand folding algorithms, the basic dynamic programming algorithm for the SAF problem [15] computes the scores for all sub-instances of its input instance, and then combines these values to resolve the score of the full input instance. Given an instance of the problem - a pair of RNA sequences S and T, the algorithm maintains the scores of sub-instances (S i,j, T i,j ) in a fourdimensional table N (see Fig. 2). For S = n and T = m, we depict N as an n n super table, in which each entry N i,j corresponds to an internal table of size m m, where the combined alignment-with-folding score of the sub-instance (S i,j, T i,j ) is stored in the entry N i,j[i, j ]. The time-complexity of the basic

11 dynamic programming algorithm for the SAF problem is dictated by the need to compute all O(n 2 m 2 ) sub-instances, where each such computation involves the consideration of a set of O(nm) competing branch point index pairs (i.e. all (q, q ) such that i < q j and i < q j ). This yields a total time complexity of O(n 3 m 3 ). Recently, [12] extended the approach of [17] and applied it to speed up SAF by reducing the number of branch points that need to be considered in the main recursion for the SAF score computation. Similarly to the concept of OCT sequences, it is possible to define OCT-aligned sequence pairs, where the pair (S i,j, T i,j ) is OCT-aligned if, in every optimal alignment-with-folding of (S i,j, T i,j ) the bases s i and t i are aligned to each other, the bases s j and t j are aligned to each other, and the common folding is co-terminus. Using this formulation to describe the results of [12], it was shown that it is sufficient to examine branch point pairs (q, q ) such that the sequences S q,j and T q,j are OCT-aligned, thus reducing the number of examined branch points and improving the running time of the algorithm. This extension yields an O(nm Z) time and O(n 2 m 2 ) space algorithm for the SAF problem, where Z is the number of OCT-aligned sub-instances, and nm Z n 2 m 2 (in practice, Z is expected to be significantly smaller than O(n 2 m 2 ) [12]). Applying an observation similar to Observation 2, an algorithm is suggested here which, upon the computation of entry N i,j [i, j ], queries only those entries which correspond to OCT-aligned sub-instances, in addition to entries in rows i and i+1 of the super table N. The space complexity of SAF is thus reduced to O(nm 2 + Z) (w.l.o.g. m n). In the extended version of this paper we describe in detail how to extend the space-reduction technique described in Section 3.2 to the four-dimensional matrix computed by the SAF algorithm [12]. An intuitive explanation can be found in Fig. 2. Lemma 6. There is an algorithm which computes the simultaneous alignment with folding of two RNA sequences S and T in O(nm Z) time and O(nm 2 + Z) space, where n = S, m = T, and w.l.o.g. m n. 4 Utilizing step characterization In this section we show how to take advantage of a step characterization of the single strand base-pairing maximization problem in order to improve the running time of the algorithms which compute it. Based on this approach, in Section 4.1 we describe an improvement to Alg. 1 which reduces its running time from O(nZ) to O ( n 2 + LZ ), and then in Section 4.2 we further reduce it to O(LZ). Both algorithms have the same space complexity as Alg. 1, which is O(Z). Let S i,j be a subsequence of S, and S i,j another subsequence such that either i = i + 1 and j = j, or i = i and j = j 1. It is straightforward to show that L (i, j ) L (i, j) L (i, j ) + 1. Therefore, we get the following observation:

12 Fig. 2. A schematic illustration of the data structure maintained by the dynamic programming algorithm for the SAF problem. In order to compute the entry N i,j[i, j ], the algorithm needs to examine values in entries marked with club symbols. These entries include entries in the ith and (i + 1)th rows of the super table (the entries N i,j[i, j 1], N i,j 1[i, j ], N i,j[i + 1, j ], N i+1,j[i, j ], N i+1,j 1[i + 1, j 1], and entries of the form N i,q 1[i, q 1]), as well as entries of the from N q,j[q, j ] which correspond to OCT-aligned sub-instances (S q,j, T q,j ) (illustrated by grayed entries). The algorithm maintains only entries in rows i and (i + 1) in the super table (marked with dot symbols) and entries which correspond to OCT-aligned sub-instances. All entries as from the (i + 2)th row in the super table which do not correspond to OCTaligned sub-instances are discarded from memory. Observation 2 For every 1 k n, the sequence L (S k,k ), L (S k,k+1 ),..., L (S k,n ), as well as the sequence L (S k,k ), L (S k 1,k ),..., L (S 1,k ) are monotonically nondecreasing with unit steps in the range 0 L. The above observation implies a bound on Z, as follows. Lemma 7. The value of Z satisfies n Z n(l + 1). Proof. Every OCT subsequence S i,j satisfies that either i = j, or L (i, j) > L (i + 1, j). Hence, according to Observation 2, there are at most L+1 OCT subsequences that end with a given index j, and at therefore there are most n(l+1) OCT subsequences of S. 4.1 An O(n 2 + LZ) Algorithm Similarly to the previously presented technique for restricting the set of examined branch points in the computation of L p (i, j), we next show another dominance relation which can be utilized to further constrain the set of branch points examined in Equation 3.3.

13 Definition 6 (step sequence). Call a subsequence S i,j a step sequence if in every optimal folding of S i,j the base i is paired. Observe that S i,j is a step sequence if and only if q = i+1 is not a branch point in any of the optimal foldings of S i,j, i.e. L(i, j) > L(i, i)+l(i+1, j) = L(i+1, j). Also note that any OCT subsequence of length greater than 1 is a step sequence, though the opposite is not necessarily true. In the following Lemma we further restrict the branch points which need to be examined in a recursive computation of L p (i, j). Lemma 8. For any subsequence S i,j such that j > i, there is an optimal branch point q with respect to S i,j such that either q = i + 1, or S i,q 1 is a step sequence and S q,j is an OCT. Proof. If q = i+1 is an optimal branch point with respect to S, the lemma holds. Otherwise, L p (i, j) > L (i, i) + L (i + 1, j) = L (i + 1, j). According to Lemma 1, there is an optimal branch point i + 1 < q j such that S q,j is an OCT. Therefore, L (i, q 1)+L (q, j) = L p (i, j) > L (i + 1, j) Obs. 1 L (i + 1, q 1)+L (q, j). It follows that L (i, q 1) > L (i + 1, q 1), hence S i,q 1 is a step sequence. Define the following subset of branch points with respect to S i,j : P i,j = {i + 1} {i + 1 < q j : S i,q 1 is a step sequence and S q,j is an OCT}. The following equation restates Equation 3.4, based on Lemma 8. L p (i, j) = max q P i,j {L (i, q 1) + L (q, j)}. (4.1) We next show a bottom-up algorithm that computes L according to Equations 3.1, 3.2, and 4.1. The presented algorithm is similar to Alg. 1, where a forward dynamic programming technique is applied in order to efficiently compute L p (i, j) (Forward dynamic programming was also applied by Jansson et al. [10] to a related problem). The new algorithm also scans and computes the entries of M in decreasing row index and increasing column index. It maintains the following invariant: upon reaching entry M[i, j], the entry contains the value L p (i, j). Before computing row i in M, the entries M[i, i 1] and M[i, i] are initialized with zeros, and all entries M[i, j] for i < j n are initialized with the corresponding values M[i + 1, j]. This initialization is equivalent to examining the branch point q = i + 1 in the computation of L p (i, j) according to equation 4.1 for all j > i (the branching at q = i+1 is handled separately from other branch points in P i,j since it does not follow the step sequence-prefix-oct-suffix rule as the rest of the group). Note that in this stage the invariant is sustained for the first entry in the row which is traversed by the algorithm - M[i, i + 1], since P i,i+1 = {i + 1}. Based on the invariant, upon reaching M[i, j], the entry contains the value L p (i, j), and the value L (i, j) can be computed by resolving the maximum

14 between the current entry value and the value of L c (i, j), which is obtained from Equation 3.2. If L c (i, j) > L p (i, j), S i,j is classified as an OCT. Then, if M[i, j] > M[i + 1, j], S i,j is classified as a step sequence, and the branch point q = j + 1 is considered and forward-reflected to the computation of L p (i, j ), for all j > j such that S j+1,j is an OCT, by updating the value of M[i, j ] to be the maximum among its current value and that of M[i, j] + M[j + 1, j ], thus accumulating the maximum according to Equation 4.1, and guaranteeing the maintenance of the invariant. Alg. 5 in the supporting on-line material implements the forward dynamic programming approach described above, combined with the space-efficient approach described in Section 3.2. An illustration of its run is given in Fig. 3. The speedup obtained by this algorithm is due to the fact that branch points are examined by Equation 4.1 only if both the sequence prefix before the branch point is a step-sequence and its suffix, as from the branch point on, is an OCT. Note that, for each one of the Z OCT subsequences S q,j which are examined as suffices by Equation 4.1, Observation 2 shows that there are at most L sequences S i,q 1 which may be corresponding step-sequence prefixes, and thus the total run-time contribution due to computation of values of the form L p (i, j) is O(LZ). Since the table M has O(n 2 ) entries, where for each entry O(1) operations are performed in addition to the operations involved in the computations of L p (i, j), the total running time is O(n 2 + LZ). The space complexity remains O(Z), as the space complexity of Alg. 1. Lemma 9. Given an RNA sequence S of length n, there is an algorithm which computes L (1, n) in O(n 2 + LZ) time and O(Z) space. 4.2 An O(LZ) Algorithm In this section we show how to further reduce the running time of the folding algorithm from O(n 2 + LZ) to O(LZ). We do so by applying a step encoding [7] to M, representing each of its rows by its O(L) steps (see Fig. 4). Hence, in what follows we give corresponding step-encoding formulations, where a typical instance is composed of a suffix S i,n of S and a folding cardinality x, which will be denoted by the pair (S i,n, x). The goal is to compute the minimum index i 1 j n such that there is a folding of S i,j whose cardinality is x. The next definition gives a step-encoding equivalents of the entities L (i, j), L p (i, j), and L c (i, j). Definition 7. For 1 i n, 1 x, and α {ɛ, p, c} (where ɛ denotes the empty word), define β α (i, x) to be the minimum index j such that L α (i, j) x, or if there is no such j. Note the relation between the step-encoding formulation and the standard formulation, where L (i, j) is the maximum x such that β (i, x) j. Say that a sub-instance (S i,n, x) is a β-oct if β (i, x) = β c (i, x) < β p (i, x). In the following definition we give the step-encoding equivalent of P i,j.

15 Fig. 3. An exemplification of Alg. 5. The left figure shows the initialization of row 3 in the table M, copying the values from row 4. The figure in the middle presents the computation of the entry M[3, 5]. Upon reaching this entry, the entry value is 0, which corresponds to L p (3, 5). Since s 5 = s 3, L c (3, 5) = M[4, 4]+1 = 1 is evaluated, the entry value is updated to be 1, and S 3,5 is marked as an OCT. Since M[3, 5] > M[4, 5], S 3,5 is classified as a step sequence, and the branching at q = 6 is considered for all sequences S 3,j such that S 6,j is an OCT. This forward computation updates the values in M[3, 6] and M[3, 9]. The right figure shows the computation of the next entry, M[3, 6]. Here there is no need to evaluate L c (3, 6) (since s 6 s 3), and S 3,6 is also classified as a step sequence (since M[3, 6] > M[4, 6]). Therefore, the branching at q = 7 considered for all sequences S 3,j such that S 7,j is an OCT, updating the values in M[3, 7] and M[3, 8]. Definition 8. { Y i,x = {i + 1} i + 1 < q β (i + 1, x 1) : } S i,q 1 is a step sequence, and. (S q,n, x L(i, q 1)) is a β-oct The auxiliary function defined below will be used in the computation of β c (i, x). Definition 9. For σ {A, C, G, U} and 1 r n, define NextMatch(r, σ) to be the minimum index r > r such that s r = σ, or if there is no such index r. We now convert Equations 3.1, 3.2 and 4.1 to their equivalent forms in the step encoding. For all 1 i n and 1 x: β (i, x) = min {β c (i, x), β p (i, x)}. (4.2) β c (i, x) = NextMatch (β (i + 1, x 1), s i ). (4.3) { } β p (i, x) = min min {β (q, x L(i, q 1))}, β c (i, x) + 1. (4.4) q Y i,x Intuitively, the computation of L (i, j) can be viewed as a competition between different foldings of the fixed sequence S i,j, in order to resolve the maximal folding cardinality, while the computation of β (i, x) can be viewed as a competition between different foldings of a fixed cardinality x and which correspond

16 to subsequences that start at index i, in order to resolve the minimal subsequence end index. Formal proofs of the correctness of Equations 4.2 to 4.4, as well as the pseudocode of an algorithm that implements them, are included in the supporting on-line material. This algorithm, denoted Alg. 6, adopts a forward dynamic programming approach, similarly to that of Alg. 5. This allows for efficient computation of Equation 4.4, where the number of sub-instances, as well as the dimensions of the data structure which stores solutions for these sub-instances, is O(Ln) (instead of O(n 2 )). Lemma 10. Given an RNA sequence S of length n, there is an algorithm which computes L (1, n) in O(LZ) time and O(Z) space. Fig. 4. An exemplification of Alg. 6. This figure shows the table B with respect to the RNA sequence S = ACAGUUGCA, which is the step encoding of the table M in Fig. 1. The left plot shows the complete table, were grayed entries correspond to β-oct subinstances. The plot at the middle presents a snapshot of the entries maintained by the algorithm at the stage where row 3 is about to be computed, after the initialization. The right plot present the computation of the entry B[3, 1]. The initial entry value 8, which corresponds to β p (3, 1), is replaced with the value NextMatch(B[4, 0], s 3) = NextMatch(3, U) = 5, and the sub-instance (S 3,n, 1) is marked as a β-oct. Then, all branch points q such the B[3, 1] = 5 q 1 < 8 = B[4, 1] are examined for the computation of β p (3, x) for all x such that the sub-instances (S q,n, x L(3, q)) = (S q,n, x 1) are β-octs. This computation updates the values in the entries B[3, 2] and B[3, 3].

17 References 1. Tatsuya Akutsu. Approximation and exact algorithms for RNA secondary structure prediction and recognition of stochastic context-free languages. Journal of Combinatorial Optimization, 3: , T. M. Chan. More algorithms for all-pairs shortest paths in weighted graphs. In Proc. 39th Symposium on the Theory of Computing (STOC), pages , Athanasius F. Bompfünewerer Consortium, Rolf Backofen, Stephan H. Bernhart, Christoph Flamm, Claudia Fried, Guido Fritzsch, Jorg Hackermuller, Jana Hertel, Ivo L. Hofacker, Kristin Missal, Axel Mosig, Sonja J. Prohaska, Dominic Rose, Peter F. Stadler, Andrea Tanzer, Stefan Washietl, and Sebastian Will. RNAs everywhere: genome-wide annotation of structured RNAs. J Exp Zoolog B Mol Dev Evol, 308(1):1 25, Eddy S. Krogh-A. Durbin, R. and G. Mitchison. Biological sequence analysis: Probabilistic models of proteins and nucleic acids. Cambridge University Press, Paul P. Gardner and Robert Giegerich. A comprehensive comparison of comparative RNA structure prediction approaches. BMC Bioinformatics, 5:140, D. S. Hirschberg. A linear space algorithm for computing maximal common subsequences. Commun. ACM, 18(6): , D.S. Hirschberg. Algorithms for the longest common subsequence problem. JACM, 24: , Ivo L. Hofacker. Vienna RNA secondary structure server. Nucleic Acids Research, (13): , Tinoco I., Borer PN., Dengler B., Levine MD., Uhlenbeck OC., Crothers DM., and Gralla J. Improved estimation of secondary structure in ribonucleic acids. Nature New Biology, 246:40 41, Jesper Jansson, See-Kiong Ng, Wing-Kin Sung, and Hugo Willy. A faster and more space-efficient algorithm for inferring arc-annotations of RNA sequences through alignment. Algorithmica, 46(2): , Havgaard JH., Lyngso RB., Stormo GD., and Gorodkin J. Pairwise local structural alignment of RNA sequences with sequence similarity less than 40%. Bioinformatics, 21(9): , Y. Wexler M. Ziv-Ukelson, I. Gat-Viks and R. Shamir. A faster algorithm for RNA co-folding. Proc. Algorithms in Bioinformatics, 8th International Workshop, WABI, LNBI 5251, pages , September David H. Mathews and Douglas H. Turner. Dynalign: an algorithm for finding the secondary structure common to two RNA sequences. Journal of Molecular Biology, 317(2): , Ruth Nussinov and Ann B. Jacobson. Fast algorithm for predicting the secondary structure of single-stranded RNA. PNAS, 77(11): , David Sankoff. Simultaneous solution of the RNA folding, alignment and protosequence problems. SIAM Journal on Applied Mathematics, 45(5): , M.S. Waterman and T.F. Smith. RNA secondary structure: a complete mathematical analysis. Math. Biosci., 42: , Ydo Wexler, Chaya Zilberstein, and Michal Ziv-Ukelson. A study of accessible motifs and RNA folding complexity. Journal of Computational Biology, 14(6): , Sebastian Will, Kristin Reiche, Ivo L. Hofacker, Peter F. Stadler, and Rolf Backofen. Inferring non-coding RNA families and classes by means of genome-scale structure-based clustering. PLOS Computational Biology, 3(4):e65, 2007.

18 19. M. Zuker. Computer prediction of RNA structure. Methods Enzymol., 180: , M. Zuker. Mfold web server for nucleic acid folding and hybridization prediction. Nucleic Acids Res., (13): , M. Zuker and P. Stiegler. Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information. Nucleic Acids Research, 9(1): , 1981.

19 Supporting on-line material: omitted figures, pseudo-code and proofs = L (i, j) i L p j i (i, j) L c j (i, j) = L (i, j) L p (i, j) L c (i, j) i j i j i L p (i, j) j = max q i i+1 max Q i, j i+1 j OCT i q q+1 j i L p (i, j) j = max i i+1 j max q P i, j i+1 i STEP OCT q q+1 j i L c (i, j) j = max i i+1 0 j 1 j i L c (i, j) j = max i i+1 0 j 1 j x x x i = β(i, x) i β p (i, x) i β c (i, x) x x i β c (i, x) +1 i β p (i, x) = min L(i, q 1 ) = x x x STEP min OCT q Y i, x i q 1q β (q, x x ) x i β c (i, x) = k where i i+1 j k = NextMatch( j, s i) j = β(i + 1, x 1) x 1 i+1 β(i + 1, x 1) Fig. 5. A schematic illustration of the different recursion formulas for the single strand base-pairing maximization problem. The left plot illustrates the recursion of Equations 3.1, 3.2, and 3.4, the middle plot illustrates the recursion of Equations 3.1, 3.2, and 4.1, and the right plot illustrates the recursion of Equations 4.2, 4.3, and 4.4.

20 Algorithm 1 space efficient RNA folding Input: An RNA sequence S = s 1s 2 s n. Output: L (1, n). 1: for i n down to 1 do 2: set M[i, i 1] 0, M[i, i] 0. 3: mark S i,i as an OCT. 4: for j i + 1 to n do 5: M[i, j] max q Q i,j {M[i, q 1] + M[q, j]}. // assigning L p (i, j) to M[i, j] 6: if s j = s i and M[i, j] < M[i + 1, j 1] + 1 then 7: M[i, j] M[i + 1, j 1] + 1. // assigning L c (i, j) to M[i, j] 8: mark S i,j as an OCT. 9: end if 10: end for 11: discard from memory the values in all entries in row i + 1 of M which do not correspond to OCT subsequences. 12: end for 13: return M[1, n]. Algorithm 2 Folding-Traceback (M) Input: A sparse table M which contains folding scores of all OCT subsequences of an RNA sequence S. Output: An optimal folding of S. 1: call Restore-entries (M, 1, n). 2: call Rec-Folding-Traceback (M, 1, n).

21 Algorithm 3 Rec-Folding-Traceback (M, i, j) Input: A sparse table M which contains folding scores of all OCT subsequences of an RNA sequence S, as well as folding scores for the subsequences S i,i+1, S i,i+2,... S i,j for the two given internal indices i and j. Output: An optimal folding of S i,j. 1: if j i then 2: halt. 3: else if S i,j is an OCT then 4: output the pair (i, j). 5: discard from memory the values in all non-oct entries M[i, i + 1], M[i, i + 2],..., M[i, j]. 6: call Restore-entries (M, i + 1, j 1). 7: call Rec-Folding-Traceback (M, i + 1, j 1). 8: else 9: find q Q i,j s.t. M[i, j] = M[i, q 1] + M[q, j]. 10: discard from memory the values in all non-oct entries M[i, q], M[i, q + 1],..., M[i, j]. 11: if q < j then 12: output the pair (q, j). 13: call Restore-entries (M, q + 1, j 1). 14: call Rec-Folding-Traceback (M, q + 1, j). 15: end if 16: call Rec-Folding-Traceback (M, i, q 1). 17: end if Algorithm 4 Restore-entries (M, i, j) Input: A sparse table M which contains folding scores of all OCT subsequences of an RNA sequence S, and two internal indices i and j. Output: The table M after restoring values in the entries M[i, i + 1], M[i, i + 2],... M[i, j]. 1: for j = i + 1 to j do 2: if M[i, j ] is discarded then 3: set M[i, j ] max M[i, q 1] + M[q, j ]. q Q i,j 4: end if 5: end for

22 Algorithm 5 forward RNA folding Input: An RNA sequence S = s 1s 2 s n. Output: L (1, n). 1: for i n down to 1 do 2: set M[i, i 1] 0, M[i, i] 0, and M[i, j] M[i + 1, j] for all i < j n. 3: for j i + 1 to n do 4: if s j = s i and M[i, j] < M[i + 1, j 1] + 1 then 5: M[i, j] M[i + 1, j 1] : mark S i,j as an OCT. 7: end if 8: if S i,j is a step sequence then 9: for all j s.t. S j+1,j is an OCT do 10: M[i, j ] max {M[i, j ], M[i, j] + M[j + 1, j ]}. 11: end for 12: end if 13: end for 14: discard from memory the values in all entries in row i + 1 of M which do not correspond to OCT subsequences. 15: end for 16: return M[1, n]. Algorithm 6 step encoded RNA folding Input: An RNA sequence S = s 1s 2 s n. Output: L (1, n). 1: set B[n, 0] n 1. 2: for i n 1 down to 1 do 3: set B[i, 0] i 1, B[i, L (i + 1, n) + 1]. 4: set B[i, x] B[i + 1, x] for all 1 x L (i + 1, n). 5: for x 1 to L (i + 1, n) + 1 do 6: if B[i, x] > NextMatch(B[i + 1, x 1], s i) then 7: B[i, x] NextMatch(B[i + 1, x 1], s i) 8: mark (S i,n, x) as a β-oct. 9: end if 10: for all index q s.t. B[i, x] q 1 < B[i + 1, x] do 11: for all x s.t. (S q,n, x ) is a β-oct do 12: B[i, x + x ] min {B[i, x + x ], B[q, x ]} 13: end for 14: end for 15: end for 16: discard from memory the values in all entries in row i + 1 of B which do not correspond to β-octsub-instances. 17: end for 18: return L (1, n) - the maximum x such that B[1, x] <.

23 Proof (Proof of correctness for Equation 4.2.). The correctness of Equation 4.2 is immediate from Definition 7. Proof (Proof of correctness for Equation 4.3.). First, we show that β c (i, x) NextMatch(β (i + 1, x 1), s i ). Let F be a co-terminus folding of S i,βc (i,x) of size x, and consider the folding F which is obtained by removing the base-pair (i, β c (i, x)) from F. Then, F is a folding of S i+1,β c (i,x) 1 of size x 1, and in particular β (i + 1, x 1) β c (i, x) 1. Since s β c (i,x) = s i, it follows that β c (i, x) = NextMatch(β c (i, x) 1, s i ) NextMatch(β (i + 1, x 1), s i ). Next, we show that β c (i, x) NextMatch(β (i + 1, x 1), s i ). Any folding which is obtained by adding the base-pair (i, NextMatch(β (i + 1, x 1), s i )) to an optimal folding of S i+1,β(i+1,x 1) is a co-terminus folding of size x for the sequence S i,nextmatch(β(i+1,x 1),si), and from the definition of β c (i, x) we have that β c (i, x) NextMatch(β (i + 1, x 1), s i ). Thus, β c (i, x) = NextMatch(β (i + 1, x 1), s i ). Before proving the correctness of Equation 4.4, we first show a property of β p (i, x). Claim 1 Let 1 i n 1 and let 1 x such that β p (i, x) n. 1. For every i < q β (i + 1, x 1), β p (i, x) β (q, x L (i, q 1)). 2. For an optimal branch point q with respect to S i,βp (i,x), β p (i, x) β (q, x L (i, q 1)). Proof. We prove separately the two parts of the claim. 1. Let i < q β (i + 1, x 1). Note that L (i, q 1) x 1 (since L (i, q 1) L (i + 1, q 1)+1 L (i + 1, β (i + 1, x 1) 1)+1 = x 1), and therefore x L (i, q 1) 1. If β (q, x L (i, q 1)) = n + 1, then clearly β p (i, x) β (q, x L (i, q 1)). Otherwise, consider a folding F of S i,β(q,x L(i,q 1)) which is obtained from the concatenation of optimal foldings for S i,q 1 and S q,β(q,x L(i,q 1)). Then, F = L (i, q 1) + L (q, β (q, x L (i, q 1))) = L (i, q 1)+x L (i, q 1) = x. Thus, F is a partitionable folding of size x for S i,β(q,x L(i,q 1)), and from the definition of β p (i, x) we have that β p (i, x) β (q, x L (i, q 1)). 2. Let F be a partitionable folding of size x for S i,β p (i,x), and q be a branch point in F. Then, x = F L (i, q 1) + L (q, β p (i, x)), and therefore, L (q, β p (i, x)) x L (i, q 1). It follows that there is a folding of size x L (i, q 1) to S q,β p (i,x), therefore β p (i, x) β (q, x L (i, q 1)). Proof (Proof of correctness for Equation 4.4). First, it is straightforwards to observe that that if (S i,n, x) is a β-oct, then β p (i, x) = β c (i, x) + 1. In addition, from Claim 1, for { every q Y i,x, β p (i, x) β (q, x L (i, } q 1)) and therefore β p (i, x) = min min {β (q, x L (i, q 1))}, β c (i, x) + 1. q Y i,x If (S i,n, x) is not a β-oct, then β p (i, x) { β c (i, x) < β c (i, x) + 1. Due to } Claim 1, in order to prove that β p (i, x) = min min {β (q, x L (i, q 1))}, β c (i, x) + 1, q Y i,x

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and education use, including for instruction at the authors institution

More information

RNA Folding Algorithms. Michal Ziv-Ukelson Ben Gurion University of the Negev

RNA Folding Algorithms. Michal Ziv-Ukelson Ben Gurion University of the Negev RNA Folding Algorithms Michal Ziv-Ukelson Ben Gurion University of the Negev The RNA Folding Problem: Given an RNA sequence, predict its energetically most stable structure (minimal free energy). AUCCCCGUAUCGAUC

More information

RNA Folding Algorithms. Michal Ziv-Ukelson Ben Gurion University of the Negev

RNA Folding Algorithms. Michal Ziv-Ukelson Ben Gurion University of the Negev RNA Folding Algorithms Michal Ziv-Ukelson Ben Gurion University of the Negev The RNA Folding Problem: Given an RNA sequence, predict its energetically most stable structure (minimal free energy). AUCCCCGUAUCGAUC

More information

Journal of Discrete Algorithms

Journal of Discrete Algorithms Journal of Discrete Algorithms 9 (2011) 2 11 Contents lists available at ScienceDirect Journal of Discrete Algorithms www.elsevier.com/locate/da Fast RNA structure alignment for crossing input structures

More information

REDUCING THE WORST CASE RUNNING TIMES OF A FAMILY OF RNA AND CFG PROBLEMS, USING VALIANT S APPROACH

REDUCING THE WORST CASE RUNNING TIMES OF A FAMILY OF RNA AND CFG PROBLEMS, USING VALIANT S APPROACH REDUCING THE WORST CASE RUNNING TIMES OF A FAMILY OF RNA AND CFG PROBLEMS, USING VALIANT S APPROACH SHAY ZAKOV, DEKEL TSUR, AND MICHAL ZIV-UKELSON Abstract. We study Valiant s classical algorithm for Context

More information

A faster algorithm for RNA co-folding

A faster algorithm for RNA co-folding A faster algorithm for RNA co-folding Michal Ziv-Ukelson 1, Irit Gat-Viks 2, Ydo Wexler 3, and Ron Shamir 4 1 Computer Science Department, Ben Gurion University of the Negev, Beer-Sheva. 2 Computational

More information

Sparse RNA Folding Revisited: Space-Efficient Minimum Free Energy Prediction

Sparse RNA Folding Revisited: Space-Efficient Minimum Free Energy Prediction Sparse RNA Folding Revisited: Space-Efficient Minimum Free Energy Prediction Sebastian Will 1 and Hosna Jabbari 2 1 Bioinformatics/IZBI, University Leipzig, swill@csail.mit.edu 2 Ingenuity Lab, National

More information

Algorithms in Bioinformatics

Algorithms in Bioinformatics Algorithms in Bioinformatics Sami Khuri Department of Computer Science San José State University San José, California, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri RNA Structure Prediction Secondary

More information

A Simple, Practical and Complete O( n3

A Simple, Practical and Complete O( n3 A Simple, Practical and Complete O( n3 log n )-Time Algorithm for RNA Folding Using the Four-Russians Speedup Yelena Frid and Dan Gusfield Department of Computer Science, U.C. Davis Abstract. The problem

More information

SA-REPC - Sequence Alignment with a Regular Expression Path Constraint

SA-REPC - Sequence Alignment with a Regular Expression Path Constraint SA-REPC - Sequence Alignment with a Regular Expression Path Constraint Nimrod Milo Tamar Pinhas Michal Ziv-Ukelson Ben-Gurion University of the Negev, Be er Sheva, Israel Graduate Seminar, BGU 2010 Milo,

More information

Sparse RNA folding revisited: space efficient minimum free energy structure prediction

Sparse RNA folding revisited: space efficient minimum free energy structure prediction DOI 10.1186/s13015-016-0071-y Algorithms for Molecular Biology RESEARCH ARTICLE Sparse RNA folding revisited: space efficient minimum free energy structure prediction Sebastian Will 1* and Hosna Jabbari

More information

13 Comparative RNA analysis

13 Comparative RNA analysis 13 Comparative RNA analysis Sources for this lecture: R. Durbin, S. Eddy, A. Krogh und G. Mitchison, Biological sequence analysis, Cambridge, 1998 D.W. Mount. Bioinformatics: Sequences and Genome analysis,

More information

98 Algorithms in Bioinformatics I, WS 06, ZBIT, D. Huson, December 6, 2006

98 Algorithms in Bioinformatics I, WS 06, ZBIT, D. Huson, December 6, 2006 98 Algorithms in Bioinformatics I, WS 06, ZBIT, D. Huson, December 6, 2006 8.3.1 Simple energy minimization Maximizing the number of base pairs as described above does not lead to good structure predictions.

More information

proteins are the basic building blocks and active players in the cell, and

proteins are the basic building blocks and active players in the cell, and 12 RN Secondary Structure Sources for this lecture: R. Durbin, S. Eddy,. Krogh und. Mitchison, Biological sequence analysis, ambridge, 1998 J. Setubal & J. Meidanis, Introduction to computational molecular

More information

Reducing the worst case running times of a family of RNA and CFG problems, using Valiant s approach

Reducing the worst case running times of a family of RNA and CFG problems, using Valiant s approach RESEARCH Open Access Reducing the worst case running times of a family of RNA and CFG problems, using Valiant s approach Shay Zakov, Dekel Tsur and Michal Ziv-Ukelson * Abstract Background: RNA secondary

More information

Structure-Based Comparison of Biomolecules

Structure-Based Comparison of Biomolecules Structure-Based Comparison of Biomolecules Benedikt Christoph Wolters Seminar Bioinformatics Algorithms RWTH AACHEN 07/17/2015 Outline 1 Introduction and Motivation Protein Structure Hierarchy Protein

More information

RNA Secondary Structure Prediction

RNA Secondary Structure Prediction RN Secondary Structure Prediction Perry Hooker S 531: dvanced lgorithms Prof. Mike Rosulek University of Montana December 10, 2010 Introduction Ribonucleic acid (RN) is a macromolecule that is essential

More information

Local Alignment of RNA Sequences with Arbitrary Scoring Schemes

Local Alignment of RNA Sequences with Arbitrary Scoring Schemes Local Alignment of RNA Sequences with Arbitrary Scoring Schemes Rolf Backofen 1, Danny Hermelin 2, ad M. Landau 2,3, and Oren Weimann 4 1 Institute of omputer Science, Albert-Ludwigs niversität Freiburg,

More information

Computational Genomics and Molecular Biology, Fall

Computational Genomics and Molecular Biology, Fall Computational Genomics and Molecular Biology, Fall 2011 1 HMM Lecture Notes Dannie Durand and Rose Hoberman October 11th 1 Hidden Markov Models In the last few lectures, we have focussed on three problems

More information

Computational Approaches for determination of Most Probable RNA Secondary Structure Using Different Thermodynamics Parameters

Computational Approaches for determination of Most Probable RNA Secondary Structure Using Different Thermodynamics Parameters Computational Approaches for determination of Most Probable RNA Secondary Structure Using Different Thermodynamics Parameters 1 Binod Kumar, Assistant Professor, Computer Sc. Dept, ISTAR, Vallabh Vidyanagar,

More information

RNA-Strukturvorhersage Strukturelle Bioinformatik WS16/17

RNA-Strukturvorhersage Strukturelle Bioinformatik WS16/17 RNA-Strukturvorhersage Strukturelle Bioinformatik WS16/17 Dr. Stefan Simm, 01.11.2016 simm@bio.uni-frankfurt.de RNA secondary structures a. hairpin loop b. stem c. bulge loop d. interior loop e. multi

More information

Combinatorial approaches to RNA folding Part II: Energy minimization via dynamic programming

Combinatorial approaches to RNA folding Part II: Energy minimization via dynamic programming ombinatorial approaches to RNA folding Part II: Energy minimization via dynamic programming Matthew Macauley Department of Mathematical Sciences lemson niversity http://www.math.clemson.edu/~macaule/ Math

More information

RNA Secondary Structure Prediction

RNA Secondary Structure Prediction RNA Secondary Structure Prediction 1 RNA structure prediction methods Base-Pair Maximization Context-Free Grammar Parsing. Free Energy Methods Covariance Models 2 The Nussinov-Jacobson Algorithm q = 9

More information

Exact Pattern Matching for RNA Structure Ensembles

Exact Pattern Matching for RNA Structure Ensembles Exact Pattern Matching for RN Structure Ensembles Christina Schmiedl 1,, Mathias Möhl 1,, Steffen Heyne 1,, Mika mit 2, Gad M. Landau 2,3, Sebastian Will 1,4,, and Rolf Backofen 1,5, 1 Bioinformatics,

More information

Lecture 2: Pairwise Alignment. CG Ron Shamir

Lecture 2: Pairwise Alignment. CG Ron Shamir Lecture 2: Pairwise Alignment 1 Main source 2 Why compare sequences? Human hexosaminidase A vs Mouse hexosaminidase A 3 www.mathworks.com/.../jan04/bio_genome.html Sequence Alignment עימוד רצפים The problem:

More information

A Simple Linear Space Algorithm for Computing a Longest Common Increasing Subsequence

A Simple Linear Space Algorithm for Computing a Longest Common Increasing Subsequence A Simple Linear Space Algorithm for Computing a Longest Common Increasing Subsequence Danlin Cai, Daxin Zhu, Lei Wang, and Xiaodong Wang Abstract This paper presents a linear space algorithm for finding

More information

RNA secondary structure prediction. Farhat Habib

RNA secondary structure prediction. Farhat Habib RNA secondary structure prediction Farhat Habib RNA RNA is similar to DNA chemically. It is usually only a single strand. T(hyamine) is replaced by U(racil) Some forms of RNA can form secondary structures

More information

CSCI Final Project Report A Parallel Implementation of Viterbi s Decoding Algorithm

CSCI Final Project Report A Parallel Implementation of Viterbi s Decoding Algorithm CSCI 1760 - Final Project Report A Parallel Implementation of Viterbi s Decoding Algorithm Shay Mozes Brown University shay@cs.brown.edu Abstract. This report describes parallel Java implementations of

More information

Two Algorithms for LCS Consecutive Suffix Alignment

Two Algorithms for LCS Consecutive Suffix Alignment Two Algorithms for LCS Consecutive Suffix Alignment Gad M. Landau Eugene Myers Michal Ziv-Ukelson Abstract The problem of aligning two sequences A and to determine their similarity is one of the fundamental

More information

CONTRAfold: RNA Secondary Structure Prediction without Physics-Based Models

CONTRAfold: RNA Secondary Structure Prediction without Physics-Based Models Supplementary Material for CONTRAfold: RNA Secondary Structure Prediction without Physics-Based Models Chuong B Do, Daniel A Woods, and Serafim Batzoglou Stanford University, Stanford, CA 94305, USA, {chuongdo,danwoods,serafim}@csstanfordedu,

More information

Rapid Dynamic Programming Algorithms for RNA Secondary Structure

Rapid Dynamic Programming Algorithms for RNA Secondary Structure ADVANCES IN APPLIED MATHEMATICS 7,455-464 I f Rapid Dynamic Programming Algorithms for RNA Secondary Structure MICHAEL S. WATERMAN* Depurtments of Muthemutics und of Biologicul Sciences, Universitk of

More information

RNA Secondary Structure Prediction: taking conservation into account

RNA Secondary Structure Prediction: taking conservation into account RNA Secondary Structure Prediction: taking conservation into account 1 Assumptions of the RNA secondary structure prediction algorithm, based on MFE: 1. The most likely structure of the RNA molecule is

More information

Combinatorial approaches to RNA folding Part I: Basics

Combinatorial approaches to RNA folding Part I: Basics Combinatorial approaches to RNA folding Part I: Basics Matthew Macauley Department of Mathematical Sciences Clemson University http://www.math.clemson.edu/~macaule/ Math 4500, Spring 2015 M. Macauley (Clemson)

More information

Sequence analysis and Genomics

Sequence analysis and Genomics Sequence analysis and Genomics October 12 th November 23 rd 2 PM 5 PM Prof. Peter Stadler Dr. Katja Nowick Katja: group leader TFome and Transcriptome Evolution Bioinformatics group Paul-Flechsig-Institute

More information

Optimal spaced seeds for faster approximate string matching

Optimal spaced seeds for faster approximate string matching Optimal spaced seeds for faster approximate string matching Martin Farach-Colton Gad M. Landau S. Cenk Sahinalp Dekel Tsur Abstract Filtering is a standard technique for fast approximate string matching

More information

DANNY BARASH ABSTRACT

DANNY BARASH ABSTRACT JOURNAL OF COMPUTATIONAL BIOLOGY Volume 11, Number 6, 2004 Mary Ann Liebert, Inc. Pp. 1169 1174 Spectral Decomposition for the Search and Analysis of RNA Secondary Structure DANNY BARASH ABSTRACT Scales

More information

Optimal spaced seeds for faster approximate string matching

Optimal spaced seeds for faster approximate string matching Optimal spaced seeds for faster approximate string matching Martin Farach-Colton Gad M. Landau S. Cenk Sahinalp Dekel Tsur Abstract Filtering is a standard technique for fast approximate string matching

More information

Shortest paths with negative lengths

Shortest paths with negative lengths Chapter 8 Shortest paths with negative lengths In this chapter we give a linear-space, nearly linear-time algorithm that, given a directed planar graph G with real positive and negative lengths, but no

More information

Impact Of The Energy Model On The Complexity Of RNA Folding With Pseudoknots

Impact Of The Energy Model On The Complexity Of RNA Folding With Pseudoknots Impact Of The Energy Model On The omplexity Of RN Folding With Pseudoknots Saad Sheikh, Rolf Backofen Yann Ponty, niversity of Florida, ainesville, S lbert Ludwigs niversity, Freiburg, ermany LIX, NRS/Ecole

More information

Predicting RNA Secondary Structure Using Profile Stochastic Context-Free Grammars and Phylogenic Analysis

Predicting RNA Secondary Structure Using Profile Stochastic Context-Free Grammars and Phylogenic Analysis Fang XY, Luo ZG, Wang ZH. Predicting RNA secondary structure using profile stochastic context-free grammars and phylogenic analysis. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 23(4): 582 589 July 2008

More information

Pairwise sequence alignment

Pairwise sequence alignment Department of Evolutionary Biology Example Alignment between very similar human alpha- and beta globins: GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKL G+ +VK+HGKKV A+++++AH+D++ +++++LS+LH KL GNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKL

More information

Linear-Time Algorithms for Finding Tucker Submatrices and Lekkerkerker-Boland Subgraphs

Linear-Time Algorithms for Finding Tucker Submatrices and Lekkerkerker-Boland Subgraphs Linear-Time Algorithms for Finding Tucker Submatrices and Lekkerkerker-Boland Subgraphs Nathan Lindzey, Ross M. McConnell Colorado State University, Fort Collins CO 80521, USA Abstract. Tucker characterized

More information

20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, Global and local alignment of two sequences using dynamic programming

20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, Global and local alignment of two sequences using dynamic programming 20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, 2008 4 Pairwise alignment We will discuss: 1. Strings 2. Dot matrix method for comparing sequences 3. Edit distance 4. Global and local alignment

More information

A GREEDY APPROXIMATION ALGORITHM FOR CONSTRUCTING SHORTEST COMMON SUPERSTRINGS *

A GREEDY APPROXIMATION ALGORITHM FOR CONSTRUCTING SHORTEST COMMON SUPERSTRINGS * A GREEDY APPROXIMATION ALGORITHM FOR CONSTRUCTING SHORTEST COMMON SUPERSTRINGS * 1 Jorma Tarhio and Esko Ukkonen Department of Computer Science, University of Helsinki Tukholmankatu 2, SF-00250 Helsinki,

More information

On the Monotonicity of the String Correction Factor for Words with Mismatches

On the Monotonicity of the String Correction Factor for Words with Mismatches On the Monotonicity of the String Correction Factor for Words with Mismatches (extended abstract) Alberto Apostolico Georgia Tech & Univ. of Padova Cinzia Pizzi Univ. of Padova & Univ. of Helsinki Abstract.

More information

O 3 O 4 O 5. q 3. q 4. Transition

O 3 O 4 O 5. q 3. q 4. Transition Hidden Markov Models Hidden Markov models (HMM) were developed in the early part of the 1970 s and at that time mostly applied in the area of computerized speech recognition. They are first described in

More information

Sequence Alignment (chapter 6)

Sequence Alignment (chapter 6) Sequence lignment (chapter 6) he biological problem lobal alignment Local alignment Multiple alignment Introduction to bioinformatics, utumn 6 Background: comparative genomics Basic question in biology:

More information

Algorithms in Bioinformatics

Algorithms in Bioinformatics Algorithms in Bioinformatics Sami Khuri Department of omputer Science San José State University San José, alifornia, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri Pairwise Sequence Alignment Homology

More information

Protein folding. α-helix. Lecture 21. An α-helix is a simple helix having on average 10 residues (3 turns of the helix)

Protein folding. α-helix. Lecture 21. An α-helix is a simple helix having on average 10 residues (3 turns of the helix) Computat onal Biology Lecture 21 Protein folding The goal is to determine the three-dimensional structure of a protein based on its amino acid sequence Assumption: amino acid sequence completely and uniquely

More information

arxiv: v3 [math.co] 17 Jan 2018

arxiv: v3 [math.co] 17 Jan 2018 An infinite class of unsaturated rooted trees corresponding to designable RNA secondary structures arxiv:1709.08088v3 [math.co] 17 Jan 2018 Jonathan Jedwab Tara Petrie Samuel Simon 23 September 2017 (revised

More information

BIOINFORMATICS. Fast evaluation of internal loops in RNA secondary structure prediction. Abstract. Introduction

BIOINFORMATICS. Fast evaluation of internal loops in RNA secondary structure prediction. Abstract. Introduction BIOINFORMATICS Fast evaluation of internal loops in RNA secondary structure prediction Abstract Motivation: Though not as abundant in known biological processes as proteins, RNA molecules serve as more

More information

A New Dominant Point-Based Parallel Algorithm for Multiple Longest Common Subsequence Problem

A New Dominant Point-Based Parallel Algorithm for Multiple Longest Common Subsequence Problem A New Dominant Point-Based Parallel Algorithm for Multiple Longest Common Subsequence Problem Dmitry Korkin This work introduces a new parallel algorithm for computing a multiple longest common subsequence

More information

On Two Class-Constrained Versions of the Multiple Knapsack Problem

On Two Class-Constrained Versions of the Multiple Knapsack Problem On Two Class-Constrained Versions of the Multiple Knapsack Problem Hadas Shachnai Tami Tamir Department of Computer Science The Technion, Haifa 32000, Israel Abstract We study two variants of the classic

More information

Dynamic Programming. Shuang Zhao. Microsoft Research Asia September 5, Dynamic Programming. Shuang Zhao. Outline. Introduction.

Dynamic Programming. Shuang Zhao. Microsoft Research Asia September 5, Dynamic Programming. Shuang Zhao. Outline. Introduction. Microsoft Research Asia September 5, 2005 1 2 3 4 Section I What is? Definition is a technique for efficiently recurrence computing by storing partial results. In this slides, I will NOT use too many formal

More information

Analysis and Design of Algorithms Dynamic Programming

Analysis and Design of Algorithms Dynamic Programming Analysis and Design of Algorithms Dynamic Programming Lecture Notes by Dr. Wang, Rui Fall 2008 Department of Computer Science Ocean University of China November 6, 2009 Introduction 2 Introduction..................................................................

More information

11.3 Decoding Algorithm

11.3 Decoding Algorithm 11.3 Decoding Algorithm 393 For convenience, we have introduced π 0 and π n+1 as the fictitious initial and terminal states begin and end. This model defines the probability P(x π) for a given sequence

More information

Efficient Polynomial-Time Algorithms for Variants of the Multiple Constrained LCS Problem

Efficient Polynomial-Time Algorithms for Variants of the Multiple Constrained LCS Problem Efficient Polynomial-Time Algorithms for Variants of the Multiple Constrained LCS Problem Hsing-Yen Ann National Center for High-Performance Computing Tainan 74147, Taiwan Chang-Biau Yang and Chiou-Ting

More information

Combinatorial Batch Codes and Transversal Matroids

Combinatorial Batch Codes and Transversal Matroids Combinatorial Batch Codes and Transversal Matroids Richard A. Brualdi, Kathleen P. Kiernan, Seth A. Meyer, Michael W. Schroeder Department of Mathematics University of Wisconsin Madison, WI 53706 {brualdi,kiernan,smeyer,schroede}@math.wisc.edu

More information

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence Page Hidden Markov models and multiple sequence alignment Russ B Altman BMI 4 CS 74 Some slides borrowed from Scott C Schmidler (BMI graduate student) References Bioinformatics Classic: Krogh et al (994)

More information

Implementing Approximate Regularities

Implementing Approximate Regularities Implementing Approximate Regularities Manolis Christodoulakis Costas S. Iliopoulos Department of Computer Science King s College London Kunsoo Park School of Computer Science and Engineering, Seoul National

More information

A tutorial on RNA folding methods and resources

A tutorial on RNA folding methods and resources A tutorial on RNA folding methods and resources Alain Denise, LRI/IGM, Université Paris-Sud with invaluable help from Yann Ponty, CNRS/Ecole Polytechnique 1 Master BIBS 2014-2015 Goals To help your work

More information

CMSC 451: Lecture 7 Greedy Algorithms for Scheduling Tuesday, Sep 19, 2017

CMSC 451: Lecture 7 Greedy Algorithms for Scheduling Tuesday, Sep 19, 2017 CMSC CMSC : Lecture Greedy Algorithms for Scheduling Tuesday, Sep 9, 0 Reading: Sects.. and. of KT. (Not covered in DPV.) Interval Scheduling: We continue our discussion of greedy algorithms with a number

More information

RNA Search and! Motif Discovery" Genome 541! Intro to Computational! Molecular Biology"

RNA Search and! Motif Discovery Genome 541! Intro to Computational! Molecular Biology RNA Search and! Motif Discovery" Genome 541! Intro to Computational! Molecular Biology" Day 1" Many biologically interesting roles for RNA" RNA secondary structure prediction" 3 4 Approaches to Structure

More information

RNA Basics. RNA bases A,C,G,U Canonical Base Pairs A-U G-C G-U. Bases can only pair with one other base. wobble pairing. 23 Hydrogen Bonds more stable

RNA Basics. RNA bases A,C,G,U Canonical Base Pairs A-U G-C G-U. Bases can only pair with one other base. wobble pairing. 23 Hydrogen Bonds more stable RNA STRUCTURE RNA Basics RNA bases A,C,G,U Canonical Base Pairs A-U G-C G-U wobble pairing Bases can only pair with one other base. 23 Hydrogen Bonds more stable RNA Basics transfer RNA (trna) messenger

More information

DNA/RNA Structure Prediction

DNA/RNA Structure Prediction C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U Master Course DNA/Protein Structurefunction Analysis and Prediction Lecture 12 DNA/RNA Structure Prediction Epigenectics Epigenomics:

More information

STRUCTURAL BIOINFORMATICS I. Fall 2015

STRUCTURAL BIOINFORMATICS I. Fall 2015 STRUCTURAL BIOINFORMATICS I Fall 2015 Info Course Number - Classification: Biology 5411 Class Schedule: Monday 5:30-7:50 PM, SERC Room 456 (4 th floor) Instructors: Vincenzo Carnevale - SERC, Room 704C;

More information

On the Sizes of Decision Diagrams Representing the Set of All Parse Trees of a Context-free Grammar

On the Sizes of Decision Diagrams Representing the Set of All Parse Trees of a Context-free Grammar Proceedings of Machine Learning Research vol 73:153-164, 2017 AMBN 2017 On the Sizes of Decision Diagrams Representing the Set of All Parse Trees of a Context-free Grammar Kei Amii Kyoto University Kyoto

More information

Whole Genome Alignments and Synteny Maps

Whole Genome Alignments and Synteny Maps Whole Genome Alignments and Synteny Maps IINTRODUCTION It was not until closely related organism genomes have been sequenced that people start to think about aligning genomes and chromosomes instead of

More information

Bio nformatics. Lecture 23. Saad Mneimneh

Bio nformatics. Lecture 23. Saad Mneimneh Bio nformatics Lecture 23 Protein folding The goal is to determine the three-dimensional structure of a protein based on its amino acid sequence Assumption: amino acid sequence completely and uniquely

More information

How Many Holes Can an Unbordered Partial Word Contain?

How Many Holes Can an Unbordered Partial Word Contain? How Many Holes Can an Unbordered Partial Word Contain? F. Blanchet-Sadri 1, Emily Allen, Cameron Byrum 3, and Robert Mercaş 4 1 University of North Carolina, Department of Computer Science, P.O. Box 6170,

More information

On-line String Matching in Highly Similar DNA Sequences

On-line String Matching in Highly Similar DNA Sequences On-line String Matching in Highly Similar DNA Sequences Nadia Ben Nsira 1,2,ThierryLecroq 1,,MouradElloumi 2 1 LITIS EA 4108, Normastic FR3638, University of Rouen, France 2 LaTICE, University of Tunis

More information

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot

More information

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME MATHEMATICAL MODELING AND THE HUMAN GENOME Hilary S. Booth Australian National University, Australia Keywords: Human genome, DNA, bioinformatics, sequence analysis, evolution. Contents 1. Introduction:

More information

An Introduction to Bioinformatics Algorithms Hidden Markov Models

An Introduction to Bioinformatics Algorithms   Hidden Markov Models Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

More information

Lecture 4: September 19

Lecture 4: September 19 CSCI1810: Computational Molecular Biology Fall 2017 Lecture 4: September 19 Lecturer: Sorin Istrail Scribe: Cyrus Cousins Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes

More information

Biologically significant sequence alignments using Boltzmann probabilities

Biologically significant sequence alignments using Boltzmann probabilities Biologically significant sequence alignments using Boltzmann probabilities P. Clote Department of Biology, Boston College Gasson Hall 416, Chestnut Hill MA 02467 clote@bc.edu May 7, 2003 Abstract In this

More information

Consensus Optimizing Both Distance Sum and Radius

Consensus Optimizing Both Distance Sum and Radius Consensus Optimizing Both Distance Sum and Radius Amihood Amir 1, Gad M. Landau 2, Joong Chae Na 3, Heejin Park 4, Kunsoo Park 5, and Jeong Seop Sim 6 1 Bar-Ilan University, 52900 Ramat-Gan, Israel 2 University

More information

Pairwise sequence alignment and pair hidden Markov models

Pairwise sequence alignment and pair hidden Markov models Pairwise sequence alignment and pair hidden Markov models Martin C. Frith April 13, 2012 ntroduction Pairwise alignment and pair hidden Markov models (phmms) are basic textbook fare [2]. However, there

More information

Characterizing Ideal Weighted Threshold Secret Sharing

Characterizing Ideal Weighted Threshold Secret Sharing Characterizing Ideal Weighted Threshold Secret Sharing Amos Beimel Tamir Tassa Enav Weinreb August 12, 2004 Abstract Weighted threshold secret sharing was introduced by Shamir in his seminal work on secret

More information

Efficient Cache-oblivious String Algorithms for Bioinformatics

Efficient Cache-oblivious String Algorithms for Bioinformatics Efficient Cache-oblivious String Algorithms for ioinformatics Rezaul Alam Chowdhury Hai-Son Le Vijaya Ramachandran UTCS Technical Report TR-07-03 February 5, 2007 Abstract We present theoretical and experimental

More information

Hidden Markov Models

Hidden Markov Models Andrea Passerini passerini@disi.unitn.it Statistical relational learning The aim Modeling temporal sequences Model signals which vary over time (e.g. speech) Two alternatives: deterministic models directly

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

More information

A Method for Aligning RNA Secondary Structures

A Method for Aligning RNA Secondary Structures Method for ligning RN Secondary Structures Jason T. L. Wang New Jersey Institute of Technology J Liu, JTL Wang, J Hu and B Tian, BM Bioinformatics, 2005 1 Outline Introduction Structural alignment of RN

More information

Speeding up the Dreyfus-Wagner Algorithm for minimum Steiner trees

Speeding up the Dreyfus-Wagner Algorithm for minimum Steiner trees Speeding up the Dreyfus-Wagner Algorithm for minimum Steiner trees Bernhard Fuchs Center for Applied Computer Science Cologne (ZAIK) University of Cologne, Weyertal 80, 50931 Köln, Germany Walter Kern

More information

RNA Secondary Structure Prediction: taking conservation into account

RNA Secondary Structure Prediction: taking conservation into account RNA Secondary Structure Prediction: taking conservation into account 1 13 June 2006 2 Main approaches to RNA secondary structure prediction Energy minimization (Single-strand Folding) does not require

More information

In-Depth Assessment of Local Sequence Alignment

In-Depth Assessment of Local Sequence Alignment 2012 International Conference on Environment Science and Engieering IPCBEE vol.3 2(2012) (2012)IACSIT Press, Singapoore In-Depth Assessment of Local Sequence Alignment Atoosa Ghahremani and Mahmood A.

More information

Codes for Partially Stuck-at Memory Cells

Codes for Partially Stuck-at Memory Cells 1 Codes for Partially Stuck-at Memory Cells Antonia Wachter-Zeh and Eitan Yaakobi Department of Computer Science Technion Israel Institute of Technology, Haifa, Israel Email: {antonia, yaakobi@cs.technion.ac.il

More information

Finding Consensus Strings With Small Length Difference Between Input and Solution Strings

Finding Consensus Strings With Small Length Difference Between Input and Solution Strings Finding Consensus Strings With Small Length Difference Between Input and Solution Strings Markus L. Schmid Trier University, Fachbereich IV Abteilung Informatikwissenschaften, D-54286 Trier, Germany, MSchmid@uni-trier.de

More information

Background: comparative genomics. Sequence similarity. Homologs. Similarity vs homology (2) Similarity vs homology. Sequence Alignment (chapter 6)

Background: comparative genomics. Sequence similarity. Homologs. Similarity vs homology (2) Similarity vs homology. Sequence Alignment (chapter 6) Sequence lignment (chapter ) he biological problem lobal alignment Local alignment Multiple alignment Background: comparative genomics Basic question in biology: what properties are shared among organisms?

More information

A fast algorithm to generate necklaces with xed content

A fast algorithm to generate necklaces with xed content Theoretical Computer Science 301 (003) 477 489 www.elsevier.com/locate/tcs Note A fast algorithm to generate necklaces with xed content Joe Sawada 1 Department of Computer Science, University of Toronto,

More information

Hidden Markov Models in computational biology. Ron Elber Computer Science Cornell

Hidden Markov Models in computational biology. Ron Elber Computer Science Cornell Hidden Markov Models in computational biology Ron Elber Computer Science Cornell 1 Or: how to fish homolog sequences from a database Many sequences in database RPOBESEQ Partitioned data base 2 An accessible

More information

Algorithmische Bioinformatik WS 11/12:, by R. Krause/ K. Reinert, 14. November 2011, 12: Motif finding

Algorithmische Bioinformatik WS 11/12:, by R. Krause/ K. Reinert, 14. November 2011, 12: Motif finding Algorithmische Bioinformatik WS 11/12:, by R. Krause/ K. Reinert, 14. November 2011, 12:00 4001 Motif finding This exposition was developed by Knut Reinert and Clemens Gröpl. It is based on the following

More information

A Robust APTAS for the Classical Bin Packing Problem

A Robust APTAS for the Classical Bin Packing Problem A Robust APTAS for the Classical Bin Packing Problem Leah Epstein 1 and Asaf Levin 2 1 Department of Mathematics, University of Haifa, 31905 Haifa, Israel. Email: lea@math.haifa.ac.il 2 Department of Statistics,

More information

arxiv: v1 [cs.ds] 15 Feb 2012

arxiv: v1 [cs.ds] 15 Feb 2012 Linear-Space Substring Range Counting over Polylogarithmic Alphabets Travis Gagie 1 and Pawe l Gawrychowski 2 1 Aalto University, Finland travis.gagie@aalto.fi 2 Max Planck Institute, Germany gawry@cs.uni.wroc.pl

More information

The Intractability of Computing the Hamming Distance

The Intractability of Computing the Hamming Distance The Intractability of Computing the Hamming Distance Bodo Manthey and Rüdiger Reischuk Universität zu Lübeck, Institut für Theoretische Informatik Wallstraße 40, 23560 Lübeck, Germany manthey/reischuk@tcs.uni-luebeck.de

More information

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Substitution score matrices, PAM, BLOSUM Needleman-Wunsch algorithm (Global) Smith-Waterman algorithm (Local) BLAST (local, heuristic) E-value

More information

Efficient Cryptanalysis of Homophonic Substitution Ciphers

Efficient Cryptanalysis of Homophonic Substitution Ciphers Efficient Cryptanalysis of Homophonic Substitution Ciphers Amrapali Dhavare Richard M. Low Mark Stamp Abstract Substitution ciphers are among the earliest methods of encryption. Examples of classic substitution

More information

Asymptotic redundancy and prolixity

Asymptotic redundancy and prolixity Asymptotic redundancy and prolixity Yuval Dagan, Yuval Filmus, and Shay Moran April 6, 2017 Abstract Gallager (1978) considered the worst-case redundancy of Huffman codes as the maximum probability tends

More information

A New Upper Bound for Max-2-SAT: A Graph-Theoretic Approach

A New Upper Bound for Max-2-SAT: A Graph-Theoretic Approach A New Upper Bound for Max-2-SAT: A Graph-Theoretic Approach Daniel Raible & Henning Fernau University of Trier, FB 4 Abteilung Informatik, 54286 Trier, Germany {raible,fernau}@informatik.uni-trier.de Abstract.

More information

Faster Algorithms for some Optimization Problems on Collinear Points

Faster Algorithms for some Optimization Problems on Collinear Points Faster Algorithms for some Optimization Problems on Collinear Points Ahmad Biniaz Prosenjit Bose Paz Carmi Anil Maheshwari J. Ian Munro Michiel Smid June 29, 2018 Abstract We propose faster algorithms

More information