Pairwise alignment, Gunnar Klau, November 9, 2005, 16:

Similar documents
2 Pairwise alignment. 2.1 References. 2.2 Importance of sequence alignment. Introduction to the pairwise sequence alignment problem.

Lecture 5,6 Local sequence alignment

Pairwise sequence alignment

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, Global and local alignment of two sequences using dynamic programming

Lecture 2: Pairwise Alignment. CG Ron Shamir

Lecture 4: September 19

Algorithms for biological sequence Comparison and Alignment

8 Grundlagen der Bioinformatik, SoSe 11, D. Huson, April 18, 2011

8 Grundlagen der Bioinformatik, SS 09, D. Huson, April 28, 2009

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

Local Alignment: Smith-Waterman algorithm

Sequence analysis and Genomics

Analysis and Design of Algorithms Dynamic Programming

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Evolution. CT Amemiya et al. Nature 496, (2013) doi: /nature12027

Bioinformatics for Computer Scientists (Part 2 Sequence Alignment) Sepp Hochreiter

Lecture 5: September Time Complexity Analysis of Local Alignment

Bio nformatics. Lecture 3. Saad Mneimneh

Sequence Comparison. mouse human

Pairwise alignment. 2.1 Introduction GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSD----LHAHKL

Global alignments - review

Computational Biology Lecture 5: Time speedup, General gap penalty function Saad Mneimneh

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Dynamic Programming II Date: 10/12/17

Aside: Golden Ratio. Golden Ratio: A universal law. Golden ratio φ = lim n = 1+ b n = a n 1. a n+1 = a n + b n, a n+b n a n

Algorithms in Bioinformatics

Pairwise & Multiple sequence alignments

Bioinformatics and BLAST

Algorithms in Bioinformatics: A Practical Introduction. Sequence Similarity

CS 580: Algorithm Design and Analysis

String Matching Problem

CSE 202 Dynamic Programming II

Sequence Alignment (chapter 6)

Minimum Edit Distance. Defini'on of Minimum Edit Distance

CS473 - Algorithms I

6.6 Sequence Alignment

A Simple Linear Space Algorithm for Computing a Longest Common Increasing Subsequence

Lecture 1, 31/10/2001: Introduction to sequence alignment. The Needleman-Wunsch algorithm for global sequence alignment: description and properties

Algorithm Design and Analysis

Sequence comparison: Score matrices

Algorithms in Bioinformatics I, ZBIT, Uni Tübingen, Daniel Huson, WS 2003/4 1

Pair Hidden Markov Models

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Pairwise alignment using HMMs

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

1.5 Sequence alignment

Single alignment: Substitution Matrix. 16 march 2017

Introduction to sequence alignment. Local alignment the Smith-Waterman algorithm

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Practical considerations of working with sequencing data

Computer Science & Engineering 423/823 Design and Analysis of Algorithms

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Computer Science & Engineering 423/823 Design and Analysis of Algorithms

In-Depth Assessment of Local Sequence Alignment

HMM : Viterbi algorithm - a toy example

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

HMM : Viterbi algorithm - a toy example

Dynamic Programming. Weighted Interval Scheduling. Algorithmic Paradigms. Dynamic Programming

Introduction to Computation & Pairwise Alignment

EECS730: Introduction to Bioinformatics

6. DYNAMIC PROGRAMMING II

Motivating the need for optimal sequence alignments...

Welcome to CS262: Computational Genomics

Chapter 1 Divide and Conquer Algorithm Theory WS 2016/17 Fabian Kuhn

Dynamic programming in less time and space

Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17:

Midterm Exam. CS 3110: Design and Analysis of Algorithms. June 20, Group 1 Group 2 Group 3

Dynamic Programming( Weighted Interval Scheduling)

Unsupervised Vocabulary Induction

Copyright 2000 N. AYDIN. All rights reserved. 1

The Simplex Algorithm

Objec&ves. Review. Dynamic Programming. What is the knapsack problem? What is our solu&on? Ø Review Knapsack Ø Sequence Alignment 3/28/18

Algebraic Dynamic Programming. Dynamic Programming, Old Country Style

Introduction to Sequence Alignment. Manpreet S. Katari

Lecture 2: Divide and conquer and Dynamic programming

CPSC 320 (Intermediate Algorithm Design and Analysis). Summer Instructor: Dr. Lior Malka Final Examination, July 24th, 2009

An Introduction to Sequence Similarity ( Homology ) Searching

Sequence Alignment. Johannes Starlinger

Note that M i,j depends on two entries in row (i 1). If we proceed in a row major order, these two entries will be available when we are ready to comp

Multiple Sequence Alignment (MAS)

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Background: comparative genomics. Sequence similarity. Homologs. Similarity vs homology (2) Similarity vs homology. Sequence Alignment (chapter 6)

Lecture 13: Dynamic Programming Part 2 10:00 AM, Feb 23, 2018

On Pattern Matching With Swaps

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1

Sequence Comparison with Mixed Convex and Concave Costs

Chapter 6. Dynamic Programming. Slides by Kevin Wayne. Copyright 2005 Pearson-Addison Wesley. All rights reserved.

arxiv: v1 [cs.ds] 9 Apr 2018

Divide and Conquer. Arash Rafiey. 27 October, 2016

Finding the Best Biological Pairwise Alignment Through Genetic Algorithm Determinando o Melhor Alinhamento Biológico Através do Algoritmo Genético

Computational Molecular Biology

Partha Sarathi Mandal

Moreover, the circular logic

1 Closest Pair of Points on the Plane

Practical Bioinformatics

Divide and Conquer. Maximum/minimum. Median finding. CS125 Lecture 4 Fall 2016

Dynamic Programming. Cormen et. al. IV 15

On the Monotonicity of the String Correction Factor for Words with Mismatches

Transcription:

Pairwise alignment, Gunnar Klau, November 9, 2005, 16:36 2012 2.1 Growth rates For biological sequence analysis, we prefer algorithms that have time and space requirements that are linear in the length of the sequences. Quadratic time (O(n 2 )) algorithms are a little slow, but feasible. Cubic time (O(n 3 )) algorithms are feasible only for very short sequences. 800 700 10*x x*x 0.1*x*x*x 600 500 400 300 200 100 0 0 5 10 15 20 2.2 Computing the score in linear space If we are only interested in the best score, but not the actual alignment, then we can reduce the space requirement to linear, since we only need values from two columns of the matrix F at a time. We can even come along with only one column and a few extra variables.

Pairwise alignment, Gunnar Klau, November 9, 2005, 16:36 2013 When we have reached the cell at the lower right corner, we know the score of an optimal global alignment. The algorithm still requires in O(mn) time to run (where m and n are the lengths of the two sequences), but it uses only O(min{m, n}) memory. But how can we obtain the actual alignment, not just its score? 2.3 Divide and conquer The following observation helps. Assume this is an optimal global alignment: GSAQVKGHGKKVADALTNAVAHV---D--DMPNALSALSDLHAHKL ++ ++++H+ KV + +A ++ +L+ L+++H+ K NNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKG Then we can split this alignment at any position and will obtain two optimal global alignments for both sides. (Really? Exercise 3.2.) GSAQVKGHGKKVADAL TNAVAHV---D--DMPNALSALSDLHAHKL ++ ++++H+ KV + +A ++ +L+ L+++H+ K NNPELQAHAGKVFKLV YEAAIQLQVTGVVVTDATLKNLGSVHVSKG So the idea is to split the alignment problem recursively into smaller pieces, until they become trivial. But... since we do not have a global alignment yet, we also do not know where to split the sequences. At least we have reduced the problem of finding an optimal alignment to another problem: finding an optimal split position. More generally, this is the idea of divide and conquer: Spend some effort to divide the problem into more manageable parts then combine the final solution out of the solutions for the subproblems. So how can we find the optimal split position in linear space?

Pairwise alignment, Gunnar Klau, November 9, 2005, 16:36 2014 2.4 Alignment in linear space We split the DP matrix at a given column u. (Usually, u := x /2 ). Then, using dynamic programming, we can find in linear space the row v where the global alignment backtrace crosses the u-th column of the DP matrix. Once we know that the alignment passes through the cell (u, v) of the DP matrix, we can split the problem of finding the alignment into an upper left and a lower right part. 0 (given:) u m (found:) 0 v n y 1 F (0, 0) x 1 x u 1 x u x u+1 x m. y v 1 y v F (u, v) y v+1. y n F (m, n) When we have found the global alignments for both parts, we concatenate them. In the setting of the NW algorithm, we could find v by tracing back from (m, n). However, a backtracking matrix like the T from the NW algorithm would require Ω(mn) space. Instead, we maintain an array of numbers (= row indices) that point directly into the u-th column: For i > u and any j, we set r(i, j) to a row where a backtracking path from (i, j) to (0, 0) goes through the u-th column. (We said... a row where a path... because the backtracking path need not be unique.) Again, we can compute the r(i, j) values in linear space (using the same trick we have seen for the score) because in the end we are only interested in one particular value: r(m, n). We stick to the notation r(, ) (with two arguments) for notational convenience. 2.5 Finding the row where to cut The values r(i, j), i > u, can be computed by the following recursion: Let (i, j ) be the cell from which F (i, j) is obtained. Then { j if i r(i, j) := = u r(i, j ) else

Pairwise alignment, Gunnar Klau, November 9, 2005, 16:36 2015 u u r(i, j ) r(i, j ) (i, j ) (i, j ) (i, j) (i, j) Since r(i, j) is computed from values in the previous and current column, this can again be done in O(n) space. The final value is v = (m, n). Now we know (u, v). We solve both subalignments by recursive calls of the same algorithm. 2.6 Time complexity of linear space alignment Since we do not construct the backtrace at once, we need to recompute many F (i, j) values in recursive calls. However the total area (number of entries) of F which is recomputed in the next stage of subdivision is only about half as much; and this goes on for its subproblems. Theorem. The total running time of the linear space alignment method is O(nm) for two sequences of length n and m, respectively. Proof. Let t DC (A) be the time needed to compute an optimal alignment with the divide-and-conquer method for a DP table of area A, let t NW (A) be the time needed for the same task using the original Needleman-Wunsch method, and let c be the constant time we need to concatenate two alignments. Since each recursive step eliminates one half of the table we have ( ) A t DC (A) = t NW (A) + t DC + c 2 ( ) ( ) A A = t NW (A) + t NW + t DC + 2c 2 4 ( ) ( ) ( ) A A A = t NW (A) + t NW + t NW + t DC + 3c 2 4 8. = = log A i=0 log A i=0 t NW ( A 2 i ) + c log A t NW (A) 2 i + c log A log A 1 = t NW (A) 2 i + c log A i=0 2t NW (A) + c log A Therefore the total running time is (still) O(mn).

Pairwise alignment, Gunnar Klau, November 9, 2005, 16:36 2016 2.7 Arbitrary gap costs A way to deal with arbitrary gap costs is as follows. Assume a gap of length g has cost γ(g). Then we can replace F (i 1, j 1) + s(x i, y i ) F (i, j) := max F (i 1, j) d with F (i 1, j 1) + s(x i, y i ) F (i, j) := max F (i g, j) γ(g), F (i, j g) γ(g), g = 1,..., i g = 1,..., j However this increases the running time from O(mn) to O(mn max{m, n}), since we have much more dependencies in the DP recurrence: linear gap cost arbitrary gap cost dependencies dependencies Note: For affine gap costs there is a clever way to do this in O(mn). The idea is to use two more arrays and have a look at the second last column of an optimal alignment. 2.8 Local alignment: Motivation Two genes in different species may be similar over short conserved regions and dissimilar over remaining regions. Example: Homeobox genes 1 have a short region called the homeodomain that is highly conserved between species. A global alignment would not find the homeodomain because it would try to align the entire sequence 2.9 Local alignment: Motivation DNA toy example: Global Alignment: --T--CC-C-AGT--TATGT-CAGGGGACACG-A-GCATGCAGA-GAC AATTGCCGCC-GTCGT-T-TTCAG----CA-GTTATG-T-CAGAT--C Local Alignment better suited to find conserved segment: tcccagttatgtcaggggacacgagcatgcagagac aattgccgccgtcgttttcagcagttatgtcagatc 1 regulate embryonic development

Pairwise alignment, Gunnar Klau, November 9, 2005, 16:36 2017 2.10 Local alignment: Smith-Waterman algorithm (Temple Smith and Mike Waterman, 1981) A local alignment of two sequences x and y is a global alignment of a substring x of x and a substring y of y. Why should we not like to include some prefixes (x 1,..., x i ) and (y 1,..., y j ) in a local alignment? The only reason is that the best global alignment of (x 1,..., x i ) and (y 1,..., y j ) has a negative score. global alignment overlap alignment local alignment + + + Therefore we modify the recurrence formula for F (i, j) such that we can start a local alignment at any place in the DP matrix. This means, we replace F (i 1, j 1) + s(x i, y i ) F (i, j) := max F (i 1, j) d with 0 F (i 1, j 1) + s(x F (i, j) := max i, y i ) F (i 1, j) d Then any (i, j) can play the role that was reserved for (0, 0) in the Needleman-Wunsch algorithm. Likewise, we start the backtrace at a position (k, l) that maximizes F (k, l), not necessarily at (m, n). This is written as (k, l) := arg max{f (k, l) k = 0,..., m, l = 0,..., n} This way, the local alignment can stop before it reaches the end of the sequences. The backtrace stops when we reach a position (i, j) such that F (i, j) = 0. 2.11 Smith-Waterman algorithm Input: two sequences x and y Output: optimal local alignment and score α Initialization: Set F (0, 0) := 0 Set F (i, 0) := 0 and T (i, 0) := (i 1, 0) for all i = 1, 2,..., m Set F (0, j) := 0 and T (0, j) := (0, j 1) for all j = 1, 2,..., n Recurrence: for i = 1, 2,..., m do: for j = 1, 2,..., n do: Set F (i, j) := max 0 F (i 1, j 1) + s(x i, y i ) F (i 1, j) d Set backtrace T (i, j) to the maximizing pair (i, j ) (encoded as {,, }) or let it be undefined in the first case Set (k, l) := arg max{f (k, l) k = 0,..., m, l = 0,..., n} The best score is α := F (k, l) Traceback: Set (i, j) := (k, l) repeat

Pairwise alignment, Gunnar Klau, November 9, 2005, 16:36 2018 if T (i, j) = (i 1, j 1) print ( x i else if T (i, j) = (i 1, j) print ( x i ) ( else print ) y j Set (i, j) := T (i, j) until F (i, j) = 0 y j ) 2.12 An example of local alignment A local alignment matrix using d = 8 and BLOSUM50 scores: Durbin et al. (1998) 2.13 Linear-space alignment, revisited Exercise 3.3: Show that the divide-and-conquer approach that was explained for global alignment can also be applied to obtain local alignments in linear space.