Computational Biology Lecture 5: Time speedup, General gap penalty function Saad Mneimneh

Similar documents
Introduction to Bioinformatics Algorithms Homework 3 Solution

Lecture 5: September Time Complexity Analysis of Local Alignment

Proving languages to be nonregular

Sums of Squares (FNS 195-S) Fall 2014

Analysis and Design of Algorithms Dynamic Programming

Lecture 4: Constructing the Integers, Rationals and Reals

Lecture 5. 1 Review (Pairwise Independence and Derandomization)

Lecture 4: September 19

Lecture 2: Pairwise Alignment. CG Ron Shamir

RATES OF CHANGE. A violin string vibrates. The rate of vibration can be measured in cycles per second (c/s),;

Pairwise alignment, Gunnar Klau, November 9, 2005, 16:

6.6 Sequence Alignment

Pair Hidden Markov Models

(Refer Slide Time: 0:21)

Bio nformatics. Lecture 3. Saad Mneimneh

Algorithms in Bioinformatics

An analogy from Calculus: limits

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

Quiz 1 Solutions. Problem 2. Asymptotics & Recurrences [20 points] (3 parts)

More Dynamic Programming

Sequence convergence, the weak T-axioms, and first countability

Objec&ves. Review. Dynamic Programming. What is the knapsack problem? What is our solu&on? Ø Review Knapsack Ø Sequence Alignment 3/28/18

CS173 Running Time and Big-O. Tandy Warnow

More Dynamic Programming

Lecture 2: Divide and conquer and Dynamic programming

Dynamic Programming. Shuang Zhao. Microsoft Research Asia September 5, Dynamic Programming. Shuang Zhao. Outline. Introduction.

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

Discrete Structures Proofwriting Checklist

Mathematical Logic Part One

Math 308 Midterm Answers and Comments July 18, Part A. Short answer questions

6. DYNAMIC PROGRAMMING II

1 Maintaining a Dictionary

CSCB63 Winter Week10 - Lecture 2 - Hashing. Anna Bretscher. March 21, / 30

, p 1 < p 2 < < p l primes.

Moreover, the circular logic

CSC236 Week 3. Larry Zhang

Final Review Sheet. B = (1, 1 + 3x, 1 + x 2 ) then 2 + 3x + 6x 2

MA 3280 Lecture 05 - Generalized Echelon Form and Free Variables. Friday, January 31, 2014.

NP-Completeness I. Lecture Overview Introduction: Reduction and Expressiveness

Solving with Absolute Value

Module 03 Lecture 14 Inferential Statistics ANOVA and TOI

Chapter 29 out of 37 from Discrete Mathematics for Neophytes: Number Theory, Probability, Algorithms, and Other Stuff by J. M.

INF4130: Dynamic Programming September 2, 2014 DRAFT version

Connectedness. Proposition 2.2. The following are equivalent for a topological space (X, T ).

Nondeterministic finite automata

Lecture 14: Random Walks, Local Graph Clustering, Linear Programming

CS261: A Second Course in Algorithms Lecture #12: Applications of Multiplicative Weights to Games and Linear Programs

5 + 9(10) + 3(100) + 0(1000) + 2(10000) =

Theoretical Computer Science

Searching Sear ( Sub- (Sub )Strings Ulf Leser

Pairwise sequence alignment

QUADRATICS 3.2 Breaking Symmetry: Factoring

Implementing Approximate Regularities

Error Correcting Codes Prof. Dr. P. Vijay Kumar Department of Electrical Communication Engineering Indian Institute of Science, Bangalore

Iterative solvers for linear equations

Math 4329: Numerical Analysis Chapter 03: Fixed Point Iteration and Ill behaving problems. Natasha S. Sharma, PhD

Please bring the task to your first physics lesson and hand it to the teacher.

Spectral Graph Theory Lecture 2. The Laplacian. Daniel A. Spielman September 4, x T M x. ψ i = arg min

Lecture 1, 31/10/2001: Introduction to sequence alignment. The Needleman-Wunsch algorithm for global sequence alignment: description and properties

2 Analogies between addition and multiplication

6 Cosets & Factor Groups

CS 6820 Fall 2014 Lectures, October 3-20, 2014

MITOCW ocw f99-lec01_300k

Bioinformatics and BLAST

Solving a Series. Carmen Bruni

Introduction - Motivation. Many phenomena (physical, chemical, biological, etc.) are model by differential equations. f f(x + h) f(x) (x) = lim

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

Solving Nernst-Planck Equations -Project Report-

STATC141 Spring 2005 The materials are from Pairwise Sequence Alignment by Robert Giegerich and David Wheeler

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

Complexity Theory of Polynomial-Time Problems

Introduction to Algebra: The First Week

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Matroids and Greedy Algorithms Date: 10/31/16

Recitation 8: Graphs and Adjacency Matrices

MITOCW ocw-18_02-f07-lec02_220k

Sometimes the domains X and Z will be the same, so this might be written:

Computational Models Lecture 8 1

Math (P)Review Part II:

Evolution. CT Amemiya et al. Nature 496, (2013) doi: /nature12027

Linear Programming and its Extensions Prof. Prabha Shrama Department of Mathematics and Statistics Indian Institute of Technology, Kanpur

CS 361 Meeting 26 11/10/17

Ph219/CS219 Problem Set 3

February 17, Simplex Method Continued

Computational complexity and some Graph Theory

STEP Support Programme. Hints and Partial Solutions for Assignment 1

MATH 31B: BONUS PROBLEMS

Lecture 9: September 28

MOL410/510 Problem Set 1 - Linear Algebra - Due Friday Sept. 30

7.2 Linear equation systems. 7.3 Linear least square fit

NUMB3RS Activity: DNA Sequence Alignment. Episode: Guns and Roses

Algebra Year 10. Language

Computational Models Lecture 8 1

Computational Models Lecture 8 1

Quantitative Understanding in Biology Module III: Linear Difference Equations Lecture I: Introduction to Linear Difference Equations

8.1 Concentration inequality for Gaussian random matrix (cont d)

Numerical Analysis Lecture Notes

Lecture 8 HASHING!!!!!

Turing Machines Part III

Calculus of One Real Variable Prof. Joydeep Dutta Department of Economic Sciences Indian Institute of Technology, Kanpur

A non-turing-recognizable language

Transcription:

Computational Biology Lecture 5: ime speedup, General gap penalty function Saad Mneimneh We saw earlier that it is possible to compute optimal global alignments in linear space (it can also be done for local alignments). Now we will look at how to speedup time in some cases. Similar sequences: Bounded dynamic programming If we believe that the two sequences x and y are similar then we might be able to optimally align them faster. For simplicity assume that m = n since our sequences are similar. If x and y align perfectly, then the optimal alignment corresponds to the diagonal in the dynamic programming table (now of size n n). his is because all of the back pointers will be pointing diagonally. herefore, if x and y are similar, we expect the alignment not to deviate much from the diagonal. For instance, consider the following two sequences x = GCGCA GGA GAGCGA and y = GCGCCA GGA GAGCA. heir optimal alignment is shown below: -GCGC-AGGAGAGCGA GCGCCAGGA-GAGC-A he following figure shows the alignment (back pointers) in the dynamic programming table. Figure : he optimal alignment of x and y above Note that since x and y have the same length, their alignment will always contain gap pairs, i.e. for each gap in x there will be a corresponding gap in y. he deviation from the diagonal is at most equal to the number of gap pairs (it could be less). herefore, if the number of gap pairs in the optimal alignment of x and y is k, we only need to look at a band of size k around the diagonal. his speeds up the time needed to compute the optimal alignment since we only update the entries A(i, j) such that A(i, j) falls inside the band. But we don t know the optimal alignment x and y to start with in order to obtain k. How should we set k then? Start with some value of k that you think is appropriate for your similar sequences. Could k be a wrong choice? Yes, and in that case the computed alignment will not be optimal. he book describes an iterative approach that doubles the value of k in each time until an optimal alignment is obtained. We will not describe the approach here, but we will assume that we make an educated guess on the value of k, and that we will be lucky and our computed alignment will be good enough if not optimal. o summarize, the main idea is to bound the dynamic programming table in a band of size k around the diagonal as illustrated in the figure below: k k Figure 2: Bounded dynamic programming

Now we only need to change the basic algorithm in such a way that we do not touch or check A(i, j) if it is outside the band. In other words, we need to assume that our dynamic programming table is just the band itself. It is easy to accomplish this since A(i, j) inside the band implies i j k (convince yourself with this simple fact). he modified Needleman-Wunsch is depicted below (pointer updates not shown, same as before). Initialization A(,j) = d.j if j k A(i,) = d.i if i k Main iteration For i= to n For j = max(,i-k) to min(n,i+k) ermination A opt = A(n,n) A(i,j) = max A(i-, j-) + s(i,j) A(i-,j) d if i--j k A(i,j-) d if i-j+ k Figure 3: Bounded Needleman-Wunsch Now the running time is just O(kn) as opposed to O(n 2 ). he Four Russians Speedup In 97 Arlazarov, V. L., E. A. Dinic, M. A. Kronrod, and I. A. Faradzev, in their paper, On economical construction of the transitive closure of a directed graph presented a technique to speedup Dynamic Programming. his technique is commonly referred to as Four Russians Speedup. he idea behind the four russians speedup is to divide the dynamic programming table into t t blocks called t blocks and then compute the values one t block at a time rather one cell at a time. Of course this by itself does not achieve any speedup, but the goal is to compute each block in O(t) time only rather than O(t 2 ), thus achieving a speedup of t. he following figure shows a t block positioned at A(i, j) in the dynamic programming table: j i+t E i A B i+t D C F Figure 4: a t block at A(i, j) Since the unit of computation is now the t block, we do not really care about the values inside the t block itself. As the above figure illustrates, we are only interested in computing the boundary F, which we will call the output of the t block. Note that given A, B, C, D, and E, output F is completely determined. We will call A, B, C, D, and E the t block input. Note also that the input and output of a t block are both O(t) in size. herefore, we will assume the existence of a t block function that, given A, B, C, D, and E, computes F in O(t) time. How can t blocks with the assumed t block function help? Here s how: divide the dynamic programming table into intersecting t blocks such that the last row of a t block is shared with the first row of the t block below it (if any), and the last column of a t block is shared with the first column of the t block to its right (if any). For simplicity assume m = n = k(t ) so that each row and each column in the table will have k intersecting t block (total number of t blocks is therefore k 2 ). he following figure illustrates the idea for n = 9 and t = 4. 2

9 9 Figure 5: 4 blocks in a table (n = 9) Now the computation of the dynamic programming table can be performed as follows: initialize the first row and first column of the whole table in a rowwise manner, use the t block function to compute the output of each t block the output of one t block will serve as input to another finally return A(n, n) he running time of the above algorithm is O(k 2 t) = O( n2 t t) = O( n2 2 t ), hence achieving a speedup of t as we said before. he big question now is how to construct the t block function which, given the input, computes the output in O(t) time. his is conceptually easy to do. If we pre-compute outputs for all given inputs and store the results indexed by the inputs, then we can retrieve the output for every input in O(t) time only (input and output both have size O(t)). So how do we pre-compute the outputs. Again easy: for every input, compute the output the usual way (cell by cell in a rowwise manner) in O(t 2 ) time and store it. he total time we spend on the pre-computation will be O(It 2 ) where I is the number of possible inputs. You might say now, hey! didn t we say that we are not going to spend the O(t 2 ) time on a t block?). Yes, but the trick is that we are spending now O(t 2 ) time for each input and not on each t block, and we are hoping that the number of inputs is small enough. But it is not! Let us find out the number of possible inputs. Well, D and E can each have O(α t ) possible values, where α is the size of the alphabet used for the sequences Moreover, C and D can each have O(L t ) values, where L is the number of possible values that a cell in the dynamic programming table can have. Of course, we expect L >> n. herefore, the number of possible inputs is I = O(α 2t L 2t ) and the running time for the pre-computation will be O(α 2t L 2t t 2 ). But t has to be at least 2 (otherwise there is no need for this whole approach). herefore this bound exceed O(n 2 ). he problem is that the number of possible inputs is huge. Can we reduce it to something linear in n? he answer is yes, but we have to first look at the following observation: Lemma: If the (+,-,) scoring scheme was used, then in any row, column, or diagonal, two adjacent cells differ by at most 3. Proof: First, A(i, j ) A(i, j) 2 because A(i, j) A(i, j ) 2 from the update rule. Now consider A(i, j) A(i, j ). Replace y j with a gap in the alignment of x..x i and y..y j with score A(i, j), we get an alignment of x..x i and y..y j with score at least A(i, j) 3 (the worst case is when x i was aligned with y j, i.e. we replaced a match with a gap). So A(i, j ) A(i, j) 3 because A(i, j ) is the optimal score for aligning x..x i and y..y j. hus A(i, j) A(i, j ) 3. A symmetric argument works for the column. We leave the diagonal as an exercise (but it is not needed for the argument). Now the trick. Since adjacent cells cannot differ a lot, we can use offset encoding to encode the input of a t block. First we define an offset vector to be a vector of size t such that the values in the vector are in the set { 3, 2,,,, 2, 3} and the first entry of the vector is. We can encode any row or column as an offset vector. For instance, consider the row 5, 4, 4, 5. his can be represented as, -,, +. Given this offset vector and the value 5, the row can be completely recovered as follows: we start with 5, we add an offset of - we obtain 4, we add an offset of we obtain 4, we add an offset of + we obtain 5. he other important observation is that given the input of a t block as offset vectors, the t block function can compute the output as offset vectors as well, without even knowing the initial values. Below is an example: 3

G C A A A-4 A-6 real values A A+ A- A-3 3 offsets A-4 A- A A A-6 A-3 A A- 3 Figure 6: Computing with offset vectors As you can see in the figure above, the initial value A is not needed, the t block function can simply work in terms of A to compute the output in offset vectors. At the end, we can determine A(n, n) by starting from A(n, ) and adding all the offsets in the offset vectors of the last row of t blocks. What have we achieved? Now the number of possible inputs is reduced considerably. Inputs are now offset vectors. An offset vector of size t can have at most 7 t values. Now the number of possible inputs I = O(α 2t 7 2t ). For DNA sequences, α = 4. herefore I = O(28 2t ). If we set t = 2 log 28 n we have I = O(n). he running time of the pre-computation is now O(It 2 ) = O(n log 2 n). he total time for computing the the output of the whole dynamic programming table as offset vectors is O(n log 2 n + n2 n2 log n ) = O( log n ). General gap penalty function So far we have considered a gap as some sort of individual mismatches and penalized them as such. here are some biological reasons to believe that once addition or deletion happens, it is very likely that it will happen over a cluster of contiguous positions, e.g. a gap of length k is more likely to occur that k contiguous gaps of length each, occurring independently. hus we more a general gap penalty function that gives us the option of not penalizing additional gaps as much as the first gap. γ(x) old model better model x Figure 7: More general gap function his new model for penalizing gaps introduces a new difficulty for our alignment algorithms. he additivity of the score does not hold anymore, a property that we have been relying upon a lot. Below is a figure that illustrates the idea. xxxxxxxxxxxxxx yyyyy-----yyyy score xxxxxxx + xxxxxxx yyyyy-- ---yyyy -γ(2) -γ(3) γ(5) γ(2) + γ(3) Figure 8: he score is not additive across a gap 4

his implies that if the alignment of x...x i and y...y j is optimal, a prefix (or suffix) of that alignment is not necessarily optimal (with the score being additive this was the case before). Here s an example: x = ABAC and y = BC. Assume that the score of a match is, that of a mismatch is ɛ, and that γ() = 2, and γ(2) = 2 ɛ. hen the best alignment is: ABAC B--C AB B- which gives a score of 2ɛ. Obviously, the part: is not optimal. So our update rules have to change. Let us investigate how we should change our basic Needleman-Wunsch algorithm to work with the new gap penalty function. he optimal alignment of x...x i and y...y j that aligns x i with y j still has a score of A(i, j ) + s(i, j). his is because the score is additive in that case. he alignment ends with x i aligned with y j so the cut does not go across a gap. What about the two other cases. Well, they are symmetric so let us look at one of them. Consider the optimal alignment of x...x i and y...y j that aligns x i with a gap. he score of that alignment is not necessarily A(i, j) γ(). his is because the cut here might go across a gap. A good observation is the following: he optimal alignment of x...x i and y...y j that aligns x i with a gap must end with a gap of a certain length. herefore, if we try all possible lengths for the gap we will eventually find the optimal alignment of x...x i and y...y j that aligns x i with a gap. herefore, we will try A(i k, j) γ(k) for k =...i. Here s the new update rule: A(i, j) = max A(i, j ) + s(i, j), P tr(i, j) = (i, j ) A(i, j k) γ(k) k =...j, P tr(i, j) = (i, j k) A(i k, j) γ(k) k =...i, P tr(i, j) = (i k, j) he initialization will also change appropriately. A(, ) =, A(i, ) = γ(i), and A(, j) = γ(j). For each entry A(i, j) we now spend O( + i + j) time. herefore, the total running time is O( i j ( + i + j)) = O(m 2 n + n 2 m). his is now cubic compared to the quadratic time for previous algorithms. We will see next lecture how to go back to our quadratic time bound using an affine gap penalty function that approximates the general one. Note that in coming up with the new update rule for A(i, j), we did not consider anything about the function γ. What if γ does not penalize additional gaps less than the first gap? here was nothing in our analysis that relied on this property. here is a catch! Our new update rule requires that property. More formally, we expect γ to satisfy γ(x + ) γ(x) γ(x) γ(x ). ry to think why this is needed by the algorithm. he book provides another algorithm where this condition is not required. References Setubal J., Meidanis, J., Introduction to Molecular Biology, Chapter 3. Gusfield D., Algorithms on Strings, rees, and Sequences, Chapter 2. 5