Sparse RNA Folding: Time and Space Efficient Algorithms

Size: px

Start display at page:

Download "Sparse RNA Folding: Time and Space Efficient Algorithms"

Clifton Turner
6 years ago
Views:

1 Sparse RNA Folding: Time and Space Efficient Algorithms Rolf Backofen 1, Dekel Tsur 2, Shay Zakov 2, and Michal Ziv-Ukelson 2 1 Albert Ludwigs University, Freiburg, Germany backofen@informatik.uni-freiburg.de 2 Department of Computer Science, Ben-Gurion University of the Negev, Israel {dekelts}, {zakovs}, {michaluz}@cs.bgu.ac.il Abstract. The classical algorithm for RNA single strand folding requires O(nZ) time and O(n 2 ) space, where n denotes the length of the input sequence and Z is a sparsity parameter which satisfies n Z n 2. We show how to reduce the space complexity of this algorithm. The space reduction is based on the observation that some solutions for subproblems are not examined after a certain stage of the algorithm, and may be discarded from memory. This yields an O(nZ) time and O(Z) space algorithm, which outputs both the cardinality of the optimal folding as well as a corresponding secondary structure. The space-efficient approach also extends to the related RNA simultaneous alignment with folding problem, and can be applied to reduce the space complexity of the fastest algorithm for this problem from O(n 2 m 2 ) down to O(nm 2 + Z), where where n and m denote the lengths of the input sequences to be aligned (w.l.o.g. m n), and Z is a sparsity parameter which satisfies nm Z n 2 m 2. In addition, we also show how to speed up the max base-pair variant of RNA single strand folding. The speed up is achieved by combining two independent existing techniques, which restrict the number of expressions that need to be examined in bottleneck computations of these algorithms. This yields an O(LZ) time and O(Z) space algorithm, where L denotes the maximum cardinality of a folding of the input sequence. 1 Introduction The structure of RNA is evolutionarily more conserved than its sequence and is thus key to its functional analysis [3]. Unfortunately, although massive amounts of sequence data are continuously generated, the number of known RNA structures is still very limited since experimental methods, such as NMR and Crystallography, require expertise and long experimental time. Therefore, computational methods for predicting RNA structures are of great value [20, 8, 19]. RNA is typically produced as a single stranded molecule, which then folds upon itself to form a number of short base-paired stems. This base-paired structure is called the secondary structure of the RNA. Paired bases almost always occur in a nested fashion in RNA secondary structure. Under the assumption

2 that the structure does not contain pseudoknots (i.e. crossing base pairs), a model was proposed by Tinoco et al. [9] to calculate the stability (in terms of free energy) of a folded RNA molecule by summing all contributions from the stabilizing, consecutive base pairs and from the loop-destabilizing terms in the secondary structure. Based on this model, dynamic programming algorithms were suggested for computing the most stable structures [16, 14, 21, 1, 17], applying various scoring criteria such as the maximal number of base pairs [14] or the minimal free energy [21]. This optimization problem is formally denoted Single RNA strand folding, and the time and space complexity of the classical algorithms for solving it is O(n 3 ) and O(n 2 ), respectively, where n denotes the length of the input RNA sequence. Recently, these were sped up to yield O(nZ) time and O(n 2 ) space [17] algorithms, where Z is a sparsity parameter which satisfies n Z n 2. We note that these algorithms are practical in the sense that the hidden constants are small. On the more theoretical front, Akutsu suggested an O(D(n)) algorithm for this problem [1], where D(n) is the time for computing the distance product of two n n matrices. The best current bound on D(n) is O(n 3 log 3 log n/ log 2 n) [2]. Another approach to RNA folding is the Sankoff simultaneous alignment with folding (SAF for short) [15, 13, 11, 12, 18]. This approach consists of finding an optimal alignment between a set of RNA sequences, where an alignment score is evaluated with respect to some common folding of the input sequences. However, as stated in [5], even for the simple case where the input consists of only two sequences, this approach requires extreme amounts of memory and space with complexity of O(n 2 m 2 ) space and O(n 3 m 3 ) time, where n and m are the lengths of the input RNA sequences to be aligned. Thus, most existing practical implementations of this algorithm [13, 11, 18] use restricted versions of the original problem. Since these restrictions introduce another source of error, it is of utmost practical importance to the research on RNA to improve both the space and time complexities of the full versions of RNA folding. A first non-heuristic speedup, which does not sacrifice the optimality of results, was recently described in [12]. This work extends the approach of [17] and yields an O(nm Z) time and O(n 2 m 2 ) space algorithm for RNA simultaneous alignment with folding, where Z is a sparsity parameter which satisfies nm Z n 2 m 2. However, experimental analysis of this algorithm indicates that the high memory requirements pose a major bottleneck in practice, both in constraining the lengths of the input sequences to be aligned, as well as in exhausting the benchmark machine s memory, which in turn results in a page-fault slowdown. Our contribution (1.) Reducing the space requirements of both RNA folding problems. In this work we focus on improving the space complexity of the base-pairing maximization variant of the RNA single strand folding problem [16, 14, 1]. The space requirement reduction is based on the observation that some solutions for subproblems are not examined after a certain stage of the algorithms, and may be discarded from memory. This yields an O(nZ) time and O(Z) space algorithm for this

3 Single strand base-paring maximization Single strand energy minimization Simultaneous alignment with folding Previous results New results Time Space Time Space O(n 3 )[14] O(nZ)[17] O(n 2 ) O(LZ) O(Z) O(D(n))[1] O(n 3 )[21] O(nZ)[17] O(n 2 ) O(nZ) O(Z) O(n 3 m 3 )[15] O(nm Z) [12] O(n2 m 2 ) O(nm Z) O(nm 2 + Z) Table 1. Time and space complexities of RNA folding algorithms. problem. In addition to the optimal folding cardinality computation, we show a trace-back procedure which outputs a corresponding secondary structure. Note that it is an interesting challenge on its own to recover an optimal folding within the time and space complexity bounds of the space-reduced algorithm, since due to the sparse representation only partial information is kept. The presented strategy may also be extended to the score computation of a family of RNA folding algorithms, which includes algorithms for the energy minimization variant of the single strand folding problem [21, 17] (improving the space complexity from O(n 2 ) to O(Z)), as well as algorithms for SAF [15, 12] (improving the space complexity from O(n 2 m 2 ) to O(nm 2 + Z)). (2.) A sparse RNA single strand folding algorithm. We also describe a fast algorithm for the max base-pair variant of RNA single-strand folding that exploits an additional sparsity parameter, based on the cardinality of the optimal folding. This is achieved by combining two independent techniques, which were previously used to reduce the number of sub-instance pairs that need to be considered by the algorithm. This combination yields the simultaneous exploitation of two key properties emerging from the formal definitions of these folding problems: the triangle inequality property, previously exploited in [17] and [12], as well as the monotonicity and unit-step properties, previously utilized in [10] for a related problem. The result is an O(LZ) time and O(Z) space algorithm, where L denotes the maximum cardinality of a folding of the input sequence and n Z nl. We note that the algorithms described here are practical in the sense that the hidden constants are small. In the context of practical contribution, we also point out that our space complexity improvements are more significant than the time complexity improvements since while the expected value of L is Θ(n) (assuming uniform character distribution), both Z and Z were experimentally shown to be significantly less than n 2 and n 2 m 2, respectively [17, 12]. Furthermore, reducing the space complexity of the simultaneous alignment with folding problem is a key result in practice, as in the previous results the space complexity was typically the computational bottleneck [5, 12].

4 RoadMap: The rest of the paper proceeds as follows. Preliminary notation and definitions are given in Section 2. The Nussinov-Jacobson [14] recurrence for computing maximum cardinality folding of an RNA sequence, and the application of the elimination technique of Wexler et al. [17] to this recurrence are described in Section 3. The latter technique is used in Section 3.2 for designing a space efficient algorithm for the single strand RNA folding problem. In the following Section 3.3, we show how to implement a space efficient traceback procedure in order to reconstruct an optimal folding from this sparse, space efficient representation. Then, in Section 3.4, the space reduction approach is shown to scale up to the RNA simultaneous alignment with folding problem. In Sections 4.1 we apply another elimination technique, which is similar to a technique of Jansson et al. [10], to further reduce the algorithm s time complexity, without increasing its space complexity. We conclude with Section 4.2, where this technique is further improved by step encoding. Due to space constraints, some figures, pseudocode and proofs are omitted, and may be found in the supporting on-line material document at (TODO:link). 2 Preliminaries An RNA sequence is a sequence over the alphabet {A, C, G, U}. Each letter in an RNA sequence is also called a base or a nucleotide. The bases A and U are called complementary bases, and so are the bases C and G 3. For a base σ {A, C, G, U}, denote by σ the complementary base of σ. Fix henceforth an RNA sequence S = s 1 s 2 s n. Denote by S i,j the subsequence s i s j of S, where S i,i 1 is defined to be the empty sequence. Definition 1 (RNA folding). A folding F of a subsequence S i,j is a set of index pairs that satisfies the following: 1. For every (k, l) F, i k < l j, and s l = s k. 2. There are no (k, l), (k, l ) F, such that k k l l. A pair (k, l) F is called a base-pair. Denote by F the size of a folding F, i.e. the number of base-pairs in F. Say that index k is paired in a folding F if k appears in a base-pair in F (either as the first or the second base in the pair), otherwise k is unpaired in F. The single strand base-pairing maximization problem was first addressed in [14]. The formal problem definition is given below. Problem 1 (Single Strand Base-Paring Maximization). Compute the maximum size of a folding of the instance sequence S. Definition 2 (branch point). Let F be a folding of some subsequence S i,j. An index q is a branch point with respect to F if for all (k, l) F, either l < q or k q. 3 For the sake of clarity, we disregard in this work the possible wobble pairing between G and U. All presented results may be easily extended to include G U pairing as well.

5 We distinguish between two kinds of foldings. For a folding F of S i,j, say that 1. F is co-terminus if F includes the base-pair (i, j). 2. F is partitionable if F does not include the base-pair (i, j). Note that for j > i, F is partitionable if and only if F has a branch point i < q j. Definition 3. For a subsequence S i,j, denote: 1. L (i, j) is the maximum size of a folding of S i,j. 2. L p (i, j) is the maximum size of a partitionable folding of S i,j. 3. L c (i, j) is the maximum size of a co-terminus folding of S i,j, or if there is no such folding (if j i or s j s i ). Call a folding F of S i,j for which F = L (i, j), an optimal folding of S i,j. In the rest of this paper, we will use L instead of L (1, n) whenever the context is clear. 3 RNA Folding via base-pairing maximization In this section we describe the recursive solution for the single RNA strand folding base-pair maximization problem, and some of its characteristics. We then present a technique for reducing its space complexity. Note that this technique easily extends to the single-strand RNA folding algorithms that are based on a thermodynamic model [20, 8, 19]. Finally, we show how to extend the spacereduction technique and apply it to the simultaneous alignment with folding algorithms [15, 12]. 3.1 A recursive solution For a subsequence S i,j such that j i, the only possible folding is the empty folding, and therefore L (i, j) = 0. The following equations show how to recursively compute L (i, j) when j > i: L (i, j) = max {L p (i, j), L c (i, j)}. (3.1) { L c L (Si+1,j 1 ) + 1, s (i, j) = j = s i, (3.2), s j s i. L p (i, j) = max i<q j {L (S i,q 1) + L (S q,j )}. (3.3) Note that the time complexity bottleneck in algorithms which implement the recursive computation of Equations 3.1 to 3.3 is due to the consideration of O(n) branch points q in the computation of L p (i, j), according to Equation 3.3. In the rest of this section, as well as in Section 4, we describe techniques that reduce the number of branch points that need to be examined in this computation, and thus improve the time complexity of such algorithms. Due to Equations 3.1 and 3.3, the following (inverse) triangle inequality is sustained in the base-paring maximization problem:

6 Observation 1 (triangle inequality) For every subsequence S i,j and for every i < q j, L (i, j) L (i, q 1) + L (q, j). Based on the triangle inequality, Wexler et al. [17] observed that it is sufficient to examine only a subset of the branch points in order to compute L p (i, j). We present here a slightly different notation for the same concept. Definition 4 (OCT). A subsequence S i,j is optimally co-terminus (OCT) if i = j, or if every optimal folding of S i,j is co-terminus (that is, if L (i, j) = L c (i, j) > L p (i, j)). Call an index q for which L p (i, j) = L (i, q 1) + L (q, j) an optimal branch point with respect to S i,j. Lemma 1 (Wexler et al. [17]). For every subsequence S i,j, there is an optimal branch point q with respect to S i,j such that S q,j is an OCT. For completeness, we give a proof for Lemma 1 with the notation of the current work. Proof. Let q be the maximum index such that L p (i, j) = L (i, q 1) + L (q, j). If q = j, then S q,j = S j,j is an OCT by definition. Otherwise, let q be any index in the range q < q j. From the selection of q, L (i, q 1) + L (q, j) = L p (i, j) > L (i, q 1)+L (q, j) Obs. 1 L (i, q 1)+L (q, q 1)+L (q, j). Therefore, L (q, j) > L (q, q 1) + L (q, n) for every q < q j, and L p (q, j) max L (q, q<q j q 1) + L (q, j) < L (q, j). Thus, S q,j is an OCT. Define the following subset of branch points with respect to S i,j : Q i,j = {i < q j : S q,j is an OCT}. Eqn. 3.3 = The following equation restates Equation 3.3, based on Lemma 1, by restricting the branch points considered by the maximization term to those in Q i,j. L p (i, j) = max q Q i,j {L (i, q 1) + L (q, j)}. (3.4) We define the following sparsity measure of RNA sequences. Definition 5. For a subsequence S i,j, Z(i, j) is the number of subsequences of S i,j which are OCTs. In the rest of this paper, we use Z instead of Z(1, n) whenever the context is clear. In the sparse case, only a small portion of the O(n 2 ) subsequences of S are OCTs. In Section 4.1 we show that, in the maximum base-pair variant of RNA single-strand folding, Z is bounded by n(l + 1). For the minimum free energy problem variant, an estimation of the expected value of a parameter related to Z, based on a probabilistic model for polymer folding and measured by simulations, which shows that that Z is significantly smaller than O(n 2 ), can be found in [17].

7 Previous algorithms for the base-pairing maximization problem were presented by Nussinov and Jacobson [14] and by Wexler et al. [17] 4. Both algorithms are dynamic programming algorithms that perform a bottom-up computation of the recurrence described in this section, where the Nussinov-Jacobson algorithm uses Equation 3.3 for the computation of L p (i, j), and the Wexler et al. algorithm improves it by using Equation 3.4. These algorithms compute the upper triangle of a table M n n, where each cell M[i, j] stores the value L (i, j). The entries of M are traversed in an order which guarantees that all values that are needed for the computation of M[i, j] = L (i, j), according to the recurrence formula, are computed and stored in M prior to the computation of M[i, j]. Upon termination, M[1, n] holds the value L. The time complexity of the algorithm by Nussinov and Jacobson is O(n 3 ), whereas that of the algorithm by Wexler et al. is O(nZ). Both algorithms use O(n 2 ) space. 3.2 A space efficient algorithm Our space reduction strategy is based on the observation that some of the values stored by the algorithm of Wexler et al. [17] are not necessary throughout the complete run of the algorithm. In the following lemma we characterize the values that need to be maintained in memory for the computation of L (i, j). Lemma 2. For a subsequence S i,j, it is possible to compute L (i, j) by examining only those values L (a, b), where i a < b j and b a < j i, which sustain that either a = i, a = i + 1, or S a,b is an OCT. Proof. Immediate from Equations 3.1, 3.2 and 3.4. Consider a dynamic programming algorithm which fills the table M by traversing its entries row by row from bottom to top, and each row from left to right. Lemma 2 implies that at the stage where M[i, j] is computed, it is sufficient to keep only the values in the currently computed i-th row, the values in the recently computed (i + 1)-th row, and values in entries which correspond to OCT subsequences of S. Thus, there is no need to maintain the complete table M in memory, rather, at each stage, entries which are guarantied not to be further examined by the algorithm may be discarded. This yields a total space complexity of O(n+Z) = O(Z). Note that the computation of each entry M[i, j] requires O( Q i,j ) operations, due to the consideration of the branch point set Q i,j (these sets are maintained as lists in order to allow an efficient traversal, n 1 n n 1 n n 1 as explained in [17]). Since Q i,j Q 1,j Z < nz, the i=1 j=i+1 i=1 j=i+1 running time of the algorithm is O(nZ). Fig. 1 illustrates a run of the above described algorithm, and Alg. 1 in the supporting on-line material gives the pseudo code. Its time and space complexities are summarized in the following lemma. 4 [17] deals with the more realistic energy minimization variant of the problem. For clarity, we project their notions on the simpler base-paring maximization variant discussed here. i=1

Grayed entries correspond to OCT subsequences. The figure in the middle presents a snapshot of the entries maintained by the algorithm at the stage where M[2, 7] is computed.

8 Lemma 3. Given an RNA sequence S of length n, there is an algorithm which computes L (1, n) in O(nZ) time and O(Z) space. Fig. 1. An exemplification of the space reduced single strand folding algorithm. The left figure shows the complete table M with respect to the RNA sequence S = ACAGUUGCA. Grayed entries correspond to OCT subsequences. The figure in the middle presents a snapshot of the entries maintained by the algorithm at the stage where M[2, 7] is computed. At this stage, the algorithm examines the sum of M[2, 6] and M[7, 7] to compute L p (2, 7) (note that Q 2,7 = {7}), and M[3, 6] + 1 to compute L c (2, 7) (since s 7 = s 2). Since L c (2, 7) = 2 > 1 = L p (2, 7), the subsequence S 2,7 is marked as an OCT. The right figure shows the computation of the next entry, M[2, 8]. Here, the evaluation of L p (2, 8) examines the sum of M[2, 6] and M[7, 8], and the sum of M[2, 7] and M[8, 8] (since Q 2,8 = {7, 8}). L c (2, 8) is not evaluated since s 8 s Folding reconstruction In addition to computing the optimal folding score of a given RNA sequence, it is often of interest to report at least one optimal folding, or a set of optimal or suboptimal foldings of the input sequence. Some well known standard techniques for reporting an optimal folding apply trace-back procedures over the folding score matrix M, in O(n 2 ) time [4]. In this section we show how to reconstruct one optimal folding, without exceeding the time and space complexities of our folding algorithm. Note that this is a challenging task, as the classical trace-back algorithm requires the availability of the full table M, while our algorithm stores only those entries of M which correspond to OCT subsequences. Assume that the full table M is given, with annotated OCT subsequences. The basic recursive folding trace-back reconstruction algorithm [4] could be modified as follows to utilize the OCT subsequences in the computation of the traceback of S i,j. 1. For j i, the only (optimal) folding of S i,j is the empty folding, and the algorithm halts without reporting any base-pair. 2. For j > i, if S i,j is an OCT, then every optimal folding of S i,j includes the pair (i, j), and therefore the algorithm reports the pair (i, j) and is called recursively on the subsequence S i+1,j 1.

9 3. Otherwise, S i,j is partitionable, and therefore the algorithm finds an index q Q i,j for which M[i, j] = M[i, q 1] + M[q, j] and then continues by recursively computing an optimal folding of S i,q 1 and of S q,j. An optimal folding of S i,q 1 is obtained by calling the trace-back algorithm recursively with the sub-instance S i,q 1. As for computing an optimal folding of S q,j, note that S q,j is an OCT, and consider the two cases, where either q = j or q < j. If q = j, then there is no need for another recursive call. Otherwise q < j, and an optimal folding of S q,j is obtained by first reporting the basepair (q, j) and then calling the trace-back algorithm recursively with the sub-instance S q+1,j 1. Time complexity analysis of the trace-back algorithm on the full table M. Observe that, when calling the above algorithm to compute the folding traceback of S i,j, recursive calls with three different subsequences could be initiated at the top level: S i 1,j 1, S q+1,j 1 and S i,q 1, and thus index j is eliminated from further consideration as an end index. Therefore, each recursive call is performed with a different end index j, and altogether there are at most n recursive calls in the whole computation. For a recursive call in which the end index is j, at most O ( Q 1,j ) operations are preformed in finding a q Q i,j for which M[i, j] = M[i, q 1] + M[q, j]. Since Q 1,j Z, the total running time is 1 j n O(Z). We next turn to address the challenge of reconstructing an optimal folding from the sparse table M computed in Section 3.2. The above described algorithm cannot be applied directly in this case, due to the fact that when the algorithm needs to find q Q i,j for which M[i, j] = M[i, q 1] + M[q, j], the values M[i, j] and M[i, q 1] may have been discarded from memory (while M[q, j] is maintained in memory since S q,j is an OCT). In order to overcome this difficulty we adopt a similar approach as of the algorithm of Hirschberg [6], namely performing on-demand value re-computations of discarded entries. Thus, it remains to show how to recover such deleted entries. Lemma 4. Given the sparse table M that contains optimal folding cardinalities for OCT subsequences, there is an algorithm which recovers the set of entries M[i, i + 1], M[i, i + 2],... M[i, j], for a pair of given indices i and j, in O(Z) time. Proof. The entries of the form M[i, j ] which have been discarded from memory correspond to partitionable subsequences, where L (i, j ) = L p (i, j ), and thus may be recomputed based solely on Equation 3.4. Observe that this computation examines only entries of the form M[i, q] for q < j, and M[q, j ] for OCT subsequences S q,j. Re-computing the entries of the ith row from left to right guaranties that upon computing M[i, j ], all necessary values for the computation of L p (i, j ) are already stored in M. For each i < j j, there are O( Q 1,j ) operations performed along this computation, due to the consideration of branch points in the set Q i,j. As before, summing this expression over all i < j j accumulates to O(Z).

10 We next show that, throughout the full run of the algorithm, the process of restoring row entries is applied to O(L) distinct start indices. Consider the case where the trace-back algorithm is applied on S i,j and assume that the set of entries M[i, i + 1], M[i, i + 2],... M[i, j] was already previously restored. Note that a recursive call with a sub-instance of the form S i,q 1 does not require the restoration of the entries M[i, i + 1], M[i, i + 2],... M[i, q 1], as (by the assumption) they have already been restored and are maintained in M. The other two possible recursive calls with sub-instances of the form S i+1,j 1 or S q+1,j 1, do require re-computation of entries in M (in rows i+1 or q+1, correspondingly). However, observe that each call of the latter kind is preceded by a detection of a base-pair. Since throughout the full run of the algorithm only L base pairs are detected, we get that the row entry recovery only needs to be executed L times (in addition to the recovery of M[1, 1], M[1, 3],... M[1, n] during initialization). Thus, according to Lemma 4, the entry value recovery contributes an additional O(LZ) factor to the total time complexity of the trace-back algorithm. Furthermore, note that upon performing such a re-computation of an entry set, there is no need to further maintain the values in M[i, i + 1], M[i, i + 2],... M[i, j] in the case where S i,j is co-terminus, nor to keep the values in M[i, q], M[i, q + 1],... M[i, j] in the case where S i,j is partitionable. This allows to discard these values from memory before the re-computation of the entry set for the corresponding sub-instance, guaranteeing that at each stage, at most n recovered entries are maintained in the sparse table M, in addition to the already existing OCT corresponding entries. Therefore, the space complexity of the trace-back algorithm remains O(Z + n) = O(Z). Alg. 2 in the supporting on-line material implements the efficient trace-back scheme. Lemma 5. There is an algorithm which, given the sparse table M that contains folding scores for all OCT subsequence of S, computes an optimal folding of S in O(LZ) time and O(Z) space. 3.4 Extending the space reduction to Simultaneous Alignment with Folding The goal of the SAF problem is to find a multiple sequence alignment and a common folding of the aligned sequences, which optimizes some score function. For simplicity, we assume the problem instance consists of two sequences. Similarly to single RNA strand folding algorithms, the basic dynamic programming algorithm for the SAF problem [15] computes the scores for all sub-instances of its input instance, and then combines these values to resolve the score of the full input instance. Given an instance of the problem - a pair of RNA sequences S and T, the algorithm maintains the scores of sub-instances (S i,j, T i,j ) in a fourdimensional table N (see Fig. 2). For S = n and T = m, we depict N as an n n super table, in which each entry N i,j corresponds to an internal table of size m m, where the combined alignment-with-folding score of the sub-instance (S i,j, T i,j ) is stored in the entry N i,j[i, j ]. The time-complexity of the basic

11 dynamic programming algorithm for the SAF problem is dictated by the need to compute all O(n 2 m 2 ) sub-instances, where each such computation involves the consideration of a set of O(nm) competing branch point index pairs (i.e. all (q, q ) such that i < q j and i < q j ). This yields a total time complexity of O(n 3 m 3 ). Recently, [12] extended the approach of [17] and applied it to speed up SAF by reducing the number of branch points that need to be considered in the main recursion for the SAF score computation. Similarly to the concept of OCT sequences, it is possible to define OCT-aligned sequence pairs, where the pair (S i,j, T i,j ) is OCT-aligned if, in every optimal alignment-with-folding of (S i,j, T i,j ) the bases s i and t i are aligned to each other, the bases s j and t j are aligned to each other, and the common folding is co-terminus. Using this formulation to describe the results of [12], it was shown that it is sufficient to examine branch point pairs (q, q ) such that the sequences S q,j and T q,j are OCT-aligned, thus reducing the number of examined branch points and improving the running time of the algorithm. This extension yields an O(nm Z) time and O(n 2 m 2 ) space algorithm for the SAF problem, where Z is the number of OCT-aligned sub-instances, and nm Z n 2 m 2 (in practice, Z is expected to be significantly smaller than O(n 2 m 2 ) [12]). Applying an observation similar to Observation 2, an algorithm is suggested here which, upon the computation of entry N i,j [i, j ], queries only those entries which correspond to OCT-aligned sub-instances, in addition to entries in rows i and i+1 of the super table N. The space complexity of SAF is thus reduced to O(nm 2 + Z) (w.l.o.g. m n). In the extended version of this paper we describe in detail how to extend the space-reduction technique described in Section 3.2 to the four-dimensional matrix computed by the SAF algorithm [12]. An intuitive explanation can be found in Fig. 2. Lemma 6. There is an algorithm which computes the simultaneous alignment with folding of two RNA sequences S and T in O(nm Z) time and O(nm 2 + Z) space, where n = S, m = T, and w.l.o.g. m n. 4 Utilizing step characterization In this section we show how to take advantage of a step characterization of the single strand base-pairing maximization problem in order to improve the running time of the algorithms which compute it. Based on this approach, in Section 4.1 we describe an improvement to Alg. 1 which reduces its running time from O(nZ) to O ( n 2 + LZ ), and then in Section 4.2 we further reduce it to O(LZ). Both algorithms have the same space complexity as Alg. 1, which is O(Z). Let S i,j be a subsequence of S, and S i,j another subsequence such that either i = i + 1 and j = j, or i = i and j = j 1. It is straightforward to show that L (i, j ) L (i, j) L (i, j ) + 1. Therefore, we get the following observation:

12 Fig. 2. A schematic illustration of the data structure maintained by the dynamic programming algorithm for the SAF problem. In order to compute the entry N i,j[i, j ], the algorithm needs to examine values in entries marked with club symbols. These entries include entries in the ith and (i + 1)th rows of the super table (the entries N i,j[i, j 1], N i,j 1[i, j ], N i,j[i + 1, j ], N i+1,j[i, j ], N i+1,j 1[i + 1, j 1], and entries of the form N i,q 1[i, q 1]), as well as entries of the from N q,j[q, j ] which correspond to OCT-aligned sub-instances (S q,j, T q,j ) (illustrated by grayed entries). The algorithm maintains only entries in rows i and (i + 1) in the super table (marked with dot symbols) and entries which correspond to OCT-aligned sub-instances. All entries as from the (i + 2)th row in the super table which do not correspond to OCTaligned sub-instances are discarded from memory. Observation 2 For every 1 k n, the sequence L (S k,k ), L (S k,k+1 ),..., L (S k,n ), as well as the sequence L (S k,k ), L (S k 1,k ),..., L (S 1,k ) are monotonically nondecreasing with unit steps in the range 0 L. The above observation implies a bound on Z, as follows. Lemma 7. The value of Z satisfies n Z n(l + 1). Proof. Every OCT subsequence S i,j satisfies that either i = j, or L (i, j) > L (i + 1, j). Hence, according to Observation 2, there are at most L+1 OCT subsequences that end with a given index j, and at therefore there are most n(l+1) OCT subsequences of S. 4.1 An O(n 2 + LZ) Algorithm Similarly to the previously presented technique for restricting the set of examined branch points in the computation of L p (i, j), we next show another dominance relation which can be utilized to further constrain the set of branch points examined in Equation 3.3.

13 Definition 6 (step sequence). Call a subsequence S i,j a step sequence if in every optimal folding of S i,j the base i is paired. Observe that S i,j is a step sequence if and only if q = i+1 is not a branch point in any of the optimal foldings of S i,j, i.e. L(i, j) > L(i, i)+l(i+1, j) = L(i+1, j). Also note that any OCT subsequence of length greater than 1 is a step sequence, though the opposite is not necessarily true. In the following Lemma we further restrict the branch points which need to be examined in a recursive computation of L p (i, j). Lemma 8. For any subsequence S i,j such that j > i, there is an optimal branch point q with respect to S i,j such that either q = i + 1, or S i,q 1 is a step sequence and S q,j is an OCT. Proof. If q = i+1 is an optimal branch point with respect to S, the lemma holds. Otherwise, L p (i, j) > L (i, i) + L (i + 1, j) = L (i + 1, j). According to Lemma 1, there is an optimal branch point i + 1 < q j such that S q,j is an OCT. Therefore, L (i, q 1)+L (q, j) = L p (i, j) > L (i + 1, j) Obs. 1 L (i + 1, q 1)+L (q, j). It follows that L (i, q 1) > L (i + 1, q 1), hence S i,q 1 is a step sequence. Define the following subset of branch points with respect to S i,j : P i,j = {i + 1} {i + 1 < q j : S i,q 1 is a step sequence and S q,j is an OCT}. The following equation restates Equation 3.4, based on Lemma 8. L p (i, j) = max q P i,j {L (i, q 1) + L (q, j)}. (4.1) We next show a bottom-up algorithm that computes L according to Equations 3.1, 3.2, and 4.1. The presented algorithm is similar to Alg. 1, where a forward dynamic programming technique is applied in order to efficiently compute L p (i, j) (Forward dynamic programming was also applied by Jansson et al. [10] to a related problem). The new algorithm also scans and computes the entries of M in decreasing row index and increasing column index. It maintains the following invariant: upon reaching entry M[i, j], the entry contains the value L p (i, j). Before computing row i in M, the entries M[i, i 1] and M[i, i] are initialized with zeros, and all entries M[i, j] for i < j n are initialized with the corresponding values M[i + 1, j]. This initialization is equivalent to examining the branch point q = i + 1 in the computation of L p (i, j) according to equation 4.1 for all j > i (the branching at q = i+1 is handled separately from other branch points in P i,j since it does not follow the step sequence-prefix-oct-suffix rule as the rest of the group). Note that in this stage the invariant is sustained for the first entry in the row which is traversed by the algorithm - M[i, i + 1], since P i,i+1 = {i + 1}. Based on the invariant, upon reaching M[i, j], the entry contains the value L p (i, j), and the value L (i, j) can be computed by resolving the maximum

14 between the current entry value and the value of L c (i, j), which is obtained from Equation 3.2. If L c (i, j) > L p (i, j), S i,j is classified as an OCT. Then, if M[i, j] > M[i + 1, j], S i,j is classified as a step sequence, and the branch point q = j + 1 is considered and forward-reflected to the computation of L p (i, j ), for all j > j such that S j+1,j is an OCT, by updating the value of M[i, j ] to be the maximum among its current value and that of M[i, j] + M[j + 1, j ], thus accumulating the maximum according to Equation 4.1, and guaranteeing the maintenance of the invariant. Alg. 5 in the supporting on-line material implements the forward dynamic programming approach described above, combined with the space-efficient approach described in Section 3.2. An illustration of its run is given in Fig. 3. The speedup obtained by this algorithm is due to the fact that branch points are examined by Equation 4.1 only if both the sequence prefix before the branch point is a step-sequence and its suffix, as from the branch point on, is an OCT. Note that, for each one of the Z OCT subsequences S q,j which are examined as suffices by Equation 4.1, Observation 2 shows that there are at most L sequences S i,q 1 which may be corresponding step-sequence prefixes, and thus the total run-time contribution due to computation of values of the form L p (i, j) is O(LZ). Since the table M has O(n 2 ) entries, where for each entry O(1) operations are performed in addition to the operations involved in the computations of L p (i, j), the total running time is O(n 2 + LZ). The space complexity remains O(Z), as the space complexity of Alg. 1. Lemma 9. Given an RNA sequence S of length n, there is an algorithm which computes L (1, n) in O(n 2 + LZ) time and O(Z) space. 4.2 An O(LZ) Algorithm In this section we show how to further reduce the running time of the folding algorithm from O(n 2 + LZ) to O(LZ). We do so by applying a step encoding [7] to M, representing each of its rows by its O(L) steps (see Fig. 4). Hence, in what follows we give corresponding step-encoding formulations, where a typical instance is composed of a suffix S i,n of S and a folding cardinality x, which will be denoted by the pair (S i,n, x). The goal is to compute the minimum index i 1 j n such that there is a folding of S i,j whose cardinality is x. The next definition gives a step-encoding equivalents of the entities L (i, j), L p (i, j), and L c (i, j). Definition 7. For 1 i n, 1 x, and α {ɛ, p, c} (where ɛ denotes the empty word), define β α (i, x) to be the minimum index j such that L α (i, j) x, or if there is no such j. Note the relation between the step-encoding formulation and the standard formulation, where L (i, j) is the maximum x such that β (i, x) j. Say that a sub-instance (S i,n, x) is a β-oct if β (i, x) = β c (i, x) < β p (i, x). In the following definition we give the step-encoding equivalent of P i,j.

15 Fig. 3. An exemplification of Alg. 5. The left figure shows the initialization of row 3 in the table M, copying the values from row 4. The figure in the middle presents the computation of the entry M[3, 5]. Upon reaching this entry, the entry value is 0, which corresponds to L p (3, 5). Since s 5 = s 3, L c (3, 5) = M[4, 4]+1 = 1 is evaluated, the entry value is updated to be 1, and S 3,5 is marked as an OCT. Since M[3, 5] > M[4, 5], S 3,5 is classified as a step sequence, and the branching at q = 6 is considered for all sequences S 3,j such that S 6,j is an OCT. This forward computation updates the values in M[3, 6] and M[3, 9]. The right figure shows the computation of the next entry, M[3, 6]. Here there is no need to evaluate L c (3, 6) (since s 6 s 3), and S 3,6 is also classified as a step sequence (since M[3, 6] > M[4, 6]). Therefore, the branching at q = 7 considered for all sequences S 3,j such that S 7,j is an OCT, updating the values in M[3, 7] and M[3, 8]. Definition 8. { Y i,x = {i + 1} i + 1 < q β (i + 1, x 1) : } S i,q 1 is a step sequence, and. (S q,n, x L(i, q 1)) is a β-oct The auxiliary function defined below will be used in the computation of β c (i, x). Definition 9. For σ {A, C, G, U} and 1 r n, define NextMatch(r, σ) to be the minimum index r > r such that s r = σ, or if there is no such index r. We now convert Equations 3.1, 3.2 and 4.1 to their equivalent forms in the step encoding. For all 1 i n and 1 x: β (i, x) = min {β c (i, x), β p (i, x)}. (4.2) β c (i, x) = NextMatch (β (i + 1, x 1), s i ). (4.3) { } β p (i, x) = min min {β (q, x L(i, q 1))}, β c (i, x) + 1. (4.4) q Y i,x Intuitively, the computation of L (i, j) can be viewed as a competition between different foldings of the fixed sequence S i,j, in order to resolve the maximal folding cardinality, while the computation of β (i, x) can be viewed as a competition between different foldings of a fixed cardinality x and which correspond

to subsequences that start at index i, in order to resolve the minimal subsequence end index. Formal proofs of the correctness of Equations 4.2 to 4.

6, adopts a forward dynamic programming approach, similarly to that of Alg. 5. This allows for efficient computation of Equation 4.

Given an RNA sequence S of length n, there is an algorithm which computes L (1, n) in O(LZ) time and O(Z) space. Fig. 4. An exemplification of Alg. 6.

16 to subsequences that start at index i, in order to resolve the minimal subsequence end index. Formal proofs of the correctness of Equations 4.2 to 4.4, as well as the pseudocode of an algorithm that implements them, are included in the supporting on-line material. This algorithm, denoted Alg. 6, adopts a forward dynamic programming approach, similarly to that of Alg. 5. This allows for efficient computation of Equation 4.4, where the number of sub-instances, as well as the dimensions of the data structure which stores solutions for these sub-instances, is O(Ln) (instead of O(n 2 )). Lemma 10. Given an RNA sequence S of length n, there is an algorithm which computes L (1, n) in O(LZ) time and O(Z) space. Fig. 4. An exemplification of Alg. 6. This figure shows the table B with respect to the RNA sequence S = ACAGUUGCA, which is the step encoding of the table M in Fig. 1. The left plot shows the complete table, were grayed entries correspond to β-oct subinstances. The plot at the middle presents a snapshot of the entries maintained by the algorithm at the stage where row 3 is about to be computed, after the initialization. The right plot present the computation of the entry B[3, 1]. The initial entry value 8, which corresponds to β p (3, 1), is replaced with the value NextMatch(B[4, 0], s 3) = NextMatch(3, U) = 5, and the sub-instance (S 3,n, 1) is marked as a β-oct. Then, all branch points q such the B[3, 1] = 5 q 1 < 8 = B[4, 1] are examined for the computation of β p (3, x) for all x such that the sub-instances (S q,n, x L(3, q)) = (S q,n, x 1) are β-octs. This computation updates the values in the entries B[3, 2] and B[3, 3].

17 References 1. Tatsuya Akutsu. Approximation and exact algorithms for RNA secondary structure prediction and recognition of stochastic context-free languages. Journal of Combinatorial Optimization, 3: , T. M. Chan. More algorithms for all-pairs shortest paths in weighted graphs. In Proc. 39th Symposium on the Theory of Computing (STOC), pages , Athanasius F. Bompfünewerer Consortium, Rolf Backofen, Stephan H. Bernhart, Christoph Flamm, Claudia Fried, Guido Fritzsch, Jorg Hackermuller, Jana Hertel, Ivo L. Hofacker, Kristin Missal, Axel Mosig, Sonja J. Prohaska, Dominic Rose, Peter F. Stadler, Andrea Tanzer, Stefan Washietl, and Sebastian Will. RNAs everywhere: genome-wide annotation of structured RNAs. J Exp Zoolog B Mol Dev Evol, 308(1):1 25, Eddy S. Krogh-A. Durbin, R. and G. Mitchison. Biological sequence analysis: Probabilistic models of proteins and nucleic acids. Cambridge University Press, Paul P. Gardner and Robert Giegerich. A comprehensive comparison of comparative RNA structure prediction approaches. BMC Bioinformatics, 5:140, D. S. Hirschberg. A linear space algorithm for computing maximal common subsequences. Commun. ACM, 18(6): , D.S. Hirschberg. Algorithms for the longest common subsequence problem. JACM, 24: , Ivo L. Hofacker. Vienna RNA secondary structure server. Nucleic Acids Research, (13): , Tinoco I., Borer PN., Dengler B., Levine MD., Uhlenbeck OC., Crothers DM., and Gralla J. Improved estimation of secondary structure in ribonucleic acids. Nature New Biology, 246:40 41, Jesper Jansson, See-Kiong Ng, Wing-Kin Sung, and Hugo Willy. A faster and more space-efficient algorithm for inferring arc-annotations of RNA sequences through alignment. Algorithmica, 46(2): , Havgaard JH., Lyngso RB., Stormo GD., and Gorodkin J. Pairwise local structural alignment of RNA sequences with sequence similarity less than 40%. Bioinformatics, 21(9): , Y. Wexler M. Ziv-Ukelson, I. Gat-Viks and R. Shamir. A faster algorithm for RNA co-folding. Proc. Algorithms in Bioinformatics, 8th International Workshop, WABI, LNBI 5251, pages , September David H. Mathews and Douglas H. Turner. Dynalign: an algorithm for finding the secondary structure common to two RNA sequences. Journal of Molecular Biology, 317(2): , Ruth Nussinov and Ann B. Jacobson. Fast algorithm for predicting the secondary structure of single-stranded RNA. PNAS, 77(11): , David Sankoff. Simultaneous solution of the RNA folding, alignment and protosequence problems. SIAM Journal on Applied Mathematics, 45(5): , M.S. Waterman and T.F. Smith. RNA secondary structure: a complete mathematical analysis. Math. Biosci., 42: , Ydo Wexler, Chaya Zilberstein, and Michal Ziv-Ukelson. A study of accessible motifs and RNA folding complexity. Journal of Computational Biology, 14(6): , Sebastian Will, Kristin Reiche, Ivo L. Hofacker, Peter F. Stadler, and Rolf Backofen. Inferring non-coding RNA families and classes by means of genome-scale structure-based clustering. PLOS Computational Biology, 3(4):e65, 2007.

18 19. M. Zuker. Computer prediction of RNA structure. Methods Enzymol., 180: , M. Zuker. Mfold web server for nucleic acid folding and hybridization prediction. Nucleic Acids Res., (13): , M. Zuker and P. Stiegler. Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information. Nucleic Acids Research, 9(1): , 1981.

19 Supporting on-line material: omitted figures, pseudo-code and proofs = L (i, j) i L p j i (i, j) L c j (i, j) = L (i, j) L p (i, j) L c (i, j) i j i j i L p (i, j) j = max q i i+1 max Q i, j i+1 j OCT i q q+1 j i L p (i, j) j = max i i+1 j max q P i, j i+1 i STEP OCT q q+1 j i L c (i, j) j = max i i+1 0 j 1 j i L c (i, j) j = max i i+1 0 j 1 j x x x i = β(i, x) i β p (i, x) i β c (i, x) x x i β c (i, x) +1 i β p (i, x) = min L(i, q 1 ) = x x x STEP min OCT q Y i, x i q 1q β (q, x x ) x i β c (i, x) = k where i i+1 j k = NextMatch( j, s i) j = β(i + 1, x 1) x 1 i+1 β(i + 1, x 1) Fig. 5. A schematic illustration of the different recursion formulas for the single strand base-pairing maximization problem. The left plot illustrates the recursion of Equations 3.1, 3.2, and 3.4, the middle plot illustrates the recursion of Equations 3.1, 3.2, and 4.1, and the right plot illustrates the recursion of Equations 4.2, 4.3, and 4.4.

20 Algorithm 1 space efficient RNA folding Input: An RNA sequence S = s 1s 2 s n. Output: L (1, n). 1: for i n down to 1 do 2: set M[i, i 1] 0, M[i, i] 0. 3: mark S i,i as an OCT. 4: for j i + 1 to n do 5: M[i, j] max q Q i,j {M[i, q 1] + M[q, j]}. // assigning L p (i, j) to M[i, j] 6: if s j = s i and M[i, j] < M[i + 1, j 1] + 1 then 7: M[i, j] M[i + 1, j 1] + 1. // assigning L c (i, j) to M[i, j] 8: mark S i,j as an OCT. 9: end if 10: end for 11: discard from memory the values in all entries in row i + 1 of M which do not correspond to OCT subsequences. 12: end for 13: return M[1, n]. Algorithm 2 Folding-Traceback (M) Input: A sparse table M which contains folding scores of all OCT subsequences of an RNA sequence S. Output: An optimal folding of S. 1: call Restore-entries (M, 1, n). 2: call Rec-Folding-Traceback (M, 1, n).

21 Algorithm 3 Rec-Folding-Traceback (M, i, j) Input: A sparse table M which contains folding scores of all OCT subsequences of an RNA sequence S, as well as folding scores for the subsequences S i,i+1, S i,i+2,... S i,j for the two given internal indices i and j. Output: An optimal folding of S i,j. 1: if j i then 2: halt. 3: else if S i,j is an OCT then 4: output the pair (i, j). 5: discard from memory the values in all non-oct entries M[i, i + 1], M[i, i + 2],..., M[i, j]. 6: call Restore-entries (M, i + 1, j 1). 7: call Rec-Folding-Traceback (M, i + 1, j 1). 8: else 9: find q Q i,j s.t. M[i, j] = M[i, q 1] + M[q, j]. 10: discard from memory the values in all non-oct entries M[i, q], M[i, q + 1],..., M[i, j]. 11: if q < j then 12: output the pair (q, j). 13: call Restore-entries (M, q + 1, j 1). 14: call Rec-Folding-Traceback (M, q + 1, j). 15: end if 16: call Rec-Folding-Traceback (M, i, q 1). 17: end if Algorithm 4 Restore-entries (M, i, j) Input: A sparse table M which contains folding scores of all OCT subsequences of an RNA sequence S, and two internal indices i and j. Output: The table M after restoring values in the entries M[i, i + 1], M[i, i + 2],... M[i, j]. 1: for j = i + 1 to j do 2: if M[i, j ] is discarded then 3: set M[i, j ] max M[i, q 1] + M[q, j ]. q Q i,j 4: end if 5: end for

22 Algorithm 5 forward RNA folding Input: An RNA sequence S = s 1s 2 s n. Output: L (1, n). 1: for i n down to 1 do 2: set M[i, i 1] 0, M[i, i] 0, and M[i, j] M[i + 1, j] for all i < j n. 3: for j i + 1 to n do 4: if s j = s i and M[i, j] < M[i + 1, j 1] + 1 then 5: M[i, j] M[i + 1, j 1] : mark S i,j as an OCT. 7: end if 8: if S i,j is a step sequence then 9: for all j s.t. S j+1,j is an OCT do 10: M[i, j ] max {M[i, j ], M[i, j] + M[j + 1, j ]}. 11: end for 12: end if 13: end for 14: discard from memory the values in all entries in row i + 1 of M which do not correspond to OCT subsequences. 15: end for 16: return M[1, n]. Algorithm 6 step encoded RNA folding Input: An RNA sequence S = s 1s 2 s n. Output: L (1, n). 1: set B[n, 0] n 1. 2: for i n 1 down to 1 do 3: set B[i, 0] i 1, B[i, L (i + 1, n) + 1]. 4: set B[i, x] B[i + 1, x] for all 1 x L (i + 1, n). 5: for x 1 to L (i + 1, n) + 1 do 6: if B[i, x] > NextMatch(B[i + 1, x 1], s i) then 7: B[i, x] NextMatch(B[i + 1, x 1], s i) 8: mark (S i,n, x) as a β-oct. 9: end if 10: for all index q s.t. B[i, x] q 1 < B[i + 1, x] do 11: for all x s.t. (S q,n, x ) is a β-oct do 12: B[i, x + x ] min {B[i, x + x ], B[q, x ]} 13: end for 14: end for 15: end for 16: discard from memory the values in all entries in row i + 1 of B which do not correspond to β-octsub-instances. 17: end for 18: return L (1, n) - the maximum x such that B[1, x] <.

23 Proof (Proof of correctness for Equation 4.2.). The correctness of Equation 4.2 is immediate from Definition 7. Proof (Proof of correctness for Equation 4.3.). First, we show that β c (i, x) NextMatch(β (i + 1, x 1), s i ). Let F be a co-terminus folding of S i,βc (i,x) of size x, and consider the folding F which is obtained by removing the base-pair (i, β c (i, x)) from F. Then, F is a folding of S i+1,β c (i,x) 1 of size x 1, and in particular β (i + 1, x 1) β c (i, x) 1. Since s β c (i,x) = s i, it follows that β c (i, x) = NextMatch(β c (i, x) 1, s i ) NextMatch(β (i + 1, x 1), s i ). Next, we show that β c (i, x) NextMatch(β (i + 1, x 1), s i ). Any folding which is obtained by adding the base-pair (i, NextMatch(β (i + 1, x 1), s i )) to an optimal folding of S i+1,β(i+1,x 1) is a co-terminus folding of size x for the sequence S i,nextmatch(β(i+1,x 1),si), and from the definition of β c (i, x) we have that β c (i, x) NextMatch(β (i + 1, x 1), s i ). Thus, β c (i, x) = NextMatch(β (i + 1, x 1), s i ). Before proving the correctness of Equation 4.4, we first show a property of β p (i, x). Claim 1 Let 1 i n 1 and let 1 x such that β p (i, x) n. 1. For every i < q β (i + 1, x 1), β p (i, x) β (q, x L (i, q 1)). 2. For an optimal branch point q with respect to S i,βp (i,x), β p (i, x) β (q, x L (i, q 1)). Proof. We prove separately the two parts of the claim. 1. Let i < q β (i + 1, x 1). Note that L (i, q 1) x 1 (since L (i, q 1) L (i + 1, q 1)+1 L (i + 1, β (i + 1, x 1) 1)+1 = x 1), and therefore x L (i, q 1) 1. If β (q, x L (i, q 1)) = n + 1, then clearly β p (i, x) β (q, x L (i, q 1)). Otherwise, consider a folding F of S i,β(q,x L(i,q 1)) which is obtained from the concatenation of optimal foldings for S i,q 1 and S q,β(q,x L(i,q 1)). Then, F = L (i, q 1) + L (q, β (q, x L (i, q 1))) = L (i, q 1)+x L (i, q 1) = x. Thus, F is a partitionable folding of size x for S i,β(q,x L(i,q 1)), and from the definition of β p (i, x) we have that β p (i, x) β (q, x L (i, q 1)). 2. Let F be a partitionable folding of size x for S i,β p (i,x), and q be a branch point in F. Then, x = F L (i, q 1) + L (q, β p (i, x)), and therefore, L (q, β p (i, x)) x L (i, q 1). It follows that there is a folding of size x L (i, q 1) to S q,β p (i,x), therefore β p (i, x) β (q, x L (i, q 1)). Proof (Proof of correctness for Equation 4.4). First, it is straightforwards to observe that that if (S i,n, x) is a β-oct, then β p (i, x) = β c (i, x) + 1. In addition, from Claim 1, for { every q Y i,x, β p (i, x) β (q, x L (i, } q 1)) and therefore β p (i, x) = min min {β (q, x L (i, q 1))}, β c (i, x) + 1. q Y i,x If (S i,n, x) is not a β-oct, then β p (i, x) { β c (i, x) < β c (i, x) + 1. Due to } Claim 1, in order to prove that β p (i, x) = min min {β (q, x L (i, q 1))}, β c (i, x) + 1, q Y i,x

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and education use, including for instruction at the authors institution