Pairwise Alignment Guan-Shieng Huang shieng@ncnu.edu.tw Dept. of CSIE, NCNU Pairwise Alignment p.1/55
Approach 1. Problem definition 2. Computational method (algorithms) 3. Complexity and performance Pairwise Alignment p.2/55
Motivations Reconstructing long sequences of DNA form overlapping sequence fragments Determining physical and genetic maps from probe data under various experiment protocols Database searching Pairwise Alignment p.3/55
Comparing two of more sequences for similarities Protein structure prediction (building profiles) Comparing the same gene sequenced by two different labs Pairwise Alignment p.4/55
Similarity & Difference 1. Common Ancestor Assumption 2. Mutation: (a) substitution (transition, transversion) (b) deletion (c) insertion We use indel to refer to deletion or insertion. Pairwise Alignment p.5/55
What is the difference between acctga and agcta? acctga agctga agct - a Pairwise Alignment p.6/55
Key Issues 1. notion of similarity/difference 2. the scoring system used to rank alignments 3. the algorithm used to find optimal scoring alignment 4. the statistical method used to evaluate the significance of an alignment score Pairwise Alignment p.7/55
Measure similarity by 1. substitution: 1 2. indel: 2 3. match: +1 Edit Distance a c c t g a a g c t - a 1-1 1 1-2 1 = 1 Pairwise Alignment p.8/55
a c c t g a a - g c t a 1-2 -1-1 -1 1 = 3 a c c t g a - a g c t a -2-1 -1-1 -1 1 = 5 Pairwise Alignment p.9/55
x: x 1 x 2 x 3... x m y: y 1 y 2 y 3... y n Alphabet: Σ = {A, G, C, T } for DNA sequence Σ = {A, G, C, U} for RNA sequence Σ = {A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y } for proteins Pairwise Alignment p.10/55
s(a, b): the score to substitute a by b s(a, ): delete a s(, b): insert b Pairwise Alignment p.11/55
Nomenclature BIOLOGY COMPUTER SCIENCE - sequence - string, word - subsequence - substring (contiguous) - N/A - subsequence - N/A - exact matching - alignment - inexact matching Pairwise Alignment p.12/55
Algorithm for Pairwise Alignment To find the best alignment (with the highest score) through Brute-force Dynamic programming Pairwise Alignment p.13/55
Brute-force Algorithm Try all possible alignments of x and y. F (m, n) = F (m 1, n) + F (m, n 1) + F (m 1, n 1) k = k 1 + k 1 l l 1 l m + n = m + n 1 + m + n 1 m m 1 m C(m, n) = C(m 1, n) + C(m, n 1) F (m, n) C(m, n) = m + n m, 2n n 22n πn. Pairwise Alignment p.14/55
Dynamic Programming Approach F (i, j): the score for the best alignment between x 1... x i and y 1... y j. F (i, j) = max F (i 1, j 1) + 1, F (i 1, j 1) 1, F (i 1, j) 2, F (i, j 1) 2, x i = y i (match) x i y i (substitution) align x i with a gap align y j with a gap Pairwise Alignment p.15/55
{ x1 x 2... x i 1 x i y 1 y 2... y j 1 y j F (i 1, j 1) + s(x i, y i ) { x1 x 2... x i 1 x i y 1 y 2... y j F (i 1, j) d { x1 x 2... x i y 1 y 2... y j 1 y j F (i, j 1) d Pairwise Alignment p.16/55
Alignment Graph F (i 1, j 1) F (i 1, j) +s(x i, y j ) d F (i, j 1) d F (i, j) Initial value: F (0, 0) = 0, F (0, j) = jd, F (i, 0) = id. Pairwise Alignment p.17/55
Example - a c c t g a - 0-2 -4-6 -8-10 -12 a -2 1-1 -3-5 -7-9 g -4-1 0-2 -4-4 -6 c -6-3 0 1-1 -3-5 t -8-5 -2-1 2 0-2 a -10-7 -4-3 0 1 1 Pairwise Alignment p.18/55
Example - a c c t g a - 0-2 -4-6 -8-10 -12 a -2 1-1 -3-5 -7-9 g -4-1 0-2 -4-4 -6 c -6-3 0 1-1 -3-5 t -8-5 -2-1 2 0-2 a -10-7 -4-3 0 1 1 backtrace Pairwise Alignment p.19/55
a c c t g a a g c t - a Pairwise Alignment p.20/55
Complexity 1. time = O(mn) 2. space= O(mn) if we need to find out the optimal alignment The problem for space is more serious when m and n are very large. Pairwise Alignment p.21/55
Linear-space Alignment Algorithm B(i, j): the best alignment score of the suffixes x m i+1... x m and y n j+1... y n F (i, j): forward matrix, B(i, j): backward matrix Then F (m, n) = max 0 k n {F (m 2, k) + B(m 2, n k)}. m 2 m 2 k n k Pairwise Alignment p.22/55
Algorithm 1. Compute F while saving the m 2 2. Compute B while saving the m 2 -th row. -th row. 3. Find the column k such that F ( m 2, k ) + B( m 2, n k ) = F (m, n). 4. Recursively partition the problem to two sub-problems: (a) Find the path from (0, 0) to ( m, 2 k ). (b) Find the path from ( m, 2 k ) to (m, n). Pairwise Alignment p.23/55
Example - a c c t g a - 0-2 -4-6 -8-10 -12 a -2 1-1 -3-5 -7-9 g -4-1 0-2 -4-4 -6 c -6-3 0 1-1 -3-5 t -8-5 -2-1 2 0-2 a -10-7 -4-3 0 1 1 (F (i, j) matrix) Pairwise Alignment p.24/55
- a g t c c a - 0-2 -4-6 -8-10 -12 a -2 1-1 -3-5 -7-9 t -4-1 0 0-2 -4-6 c -6-3 -2-1 1-1 -3 g -8-5 -2-3 -2 0-2 a -10-7 -4-3 -4-2 1 (B(i, j) matrix) Pairwise Alignment p.25/55
-4-1 0-2 -4-4 -6-6 -3-2 -1 1-1 -3 F ( m 2, k ) + B( m 2, n k ) = F (m, n). In this case, F (m, n) = 1 and k = 2. Hence, the best alignment of (acctga,agcta) is the concatenation of (ac,ag) and (ctga,cta). Pairwise Alignment p.26/55
Analysis of Complexity Clearly, the required space is O(min(m, n)). For time complexity, let T (m, n) be the time bound of the algorithm. Hence, we have T (m, n) = T ( m 2, k) + T ( m 2, n k) + O(mn) for some k. Pairwise Alignment p.27/55
T (m, n) = T ( m 2, k) + T (m 2, n k) + cmn) for some k. Suppose T (m, n) = αmn, then the right hand side becomes α m 2 k + αm αmn (n k) + cmn = + cmn. 2 2 Let α = 2c, then it equals to the left-hand side. Pairwise Alignment p.28/55
For more information on linear-space algorithms in pairwise alignment, see Chao, K. M., Hardison, R. C., and Miller, W. 1994. Recent developments in linear-space alignment methods: a survey. Journal of Computational Biology, 1:271 291. Pairwise Alignment p.29/55
Revisiting Dynamic Programming Principle of optimality Recurrence Bottom up Pairwise Alignment p.30/55
Substitution matrices Suppose we have two models: 1. random model 2. match model Given any two aligned sequences x = x 1 x 2... x n y = y 1 y 2... y n where x i is aligned with y i. Pairwise Alignment p.31/55
In random model R, we suppose each letter a occurs independently with some frequency q a. Hence, Pr(x, y R) = q xi q yj. i j In match model M, letters a and b are aligned with joint probability p ab. Suppose residues a and b have been derived indep. from some unknown residue c. Hence, Pr(x, y M) = i p xi y i. Pairwise Alignment p.32/55
Define the odds ratio as Pr(x, y M) Pr(x, y R) = i q x i The log-odds ratio: i p x i y i j q y j = i p xi y i q xi q yi. S = i s(x i, y i ) where s(a, b) = log( p ab q a q b ). S > 0 means that x, y are more likely to be an instance of the match model. (Maximum Likelihood) BLOSUM & PAM matrices for proteins Pairwise Alignment p.33/55
PAM matrices 1. Dayhoff, Schwartz, Orcutt (1978) 2. The most widely used matrix is PAM250. Pairwise Alignment p.34/55
Pairwise Alignment p.35/55
BLOSUM Matrices 1. Henikoff & Henikoff (1992) 2. Derived from a set of aligned, ungapped regions from protein families called the BLOCKS database. 3. BLOSUM62 is the standard for ungapped matching. 4. BLOSUM50 is better for alignment with gaps. Pairwise Alignment p.36/55
BLOSUM50 Pairwise Alignment p.37/55
Pairwise Alignment Problems 1. Global alignment (Needleman & Wunsch, 1970) 2. Local alignment (Smith-Waterman, 1981) 3. End-space free alignment 4. Gap penality The version we currently used was due to Gotoh (1982). Pairwise Alignment p.38/55
Global Alignment Given two sequences x and y, what is the maximum similarity between them? Find a best alignment. Pairwise Alignment p.39/55
Local Alignment Given two sequences x and y, what is the maximum similarity between a subsequence of x and a subsequence of y? Find most similar subsequences. Pairwise Alignment p.40/55
End-space Free Alignment or Pairwise Alignment p.41/55
Global Alignment F (i, j) = max F (i 1, j 1) + s(x i, y j ), F (i 1, j) d, F (i, j 1) d. with initial value F (0, 0) = 0, F (0, j) = jd, F (i, 0) = id. And F (m, n) is the score. Pairwise Alignment p.42/55
Example Pairwise Alignment p.43/55
Local Alignment Motivation: Ignore stretches of non-coding DNA. Protein domains Pairwise Alignment p.44/55
Local Alignment F (i, j) = max 0, F (i 1, j 1) + s(x i, y j ), F (i 1, j) d, F (i, j 1) d. with initial value F (0, 0) = F (0, j) = F (i, 0) = 0. And the highest value of F (i, j) over the whole matrix is the score. Pairwise Alignment p.45/55
Example Pairwise Alignment p.46/55
Ends-free Alignment Motivation: shotgun sequence assembly Pairwise Alignment p.47/55
Ends-free Alignment F (i, j) = max F (i 1, j 1) + s(x i, y j ), F (i 1, j) d, F (i, j 1) d. with initial value F (0, 0) = F (0, j) = F (i, 0) = 0. And the highest value of F (i, j) in the last column F (i, n) or the last row F (m, j ) is the score. Pairwise Alignment p.48/55
Example Pairwise Alignment p.49/55
Complexity All of the above algorithms can be implemented in time O(mn) and in space O(m + n). Pairwise Alignment p.50/55
Gap Penality A gap is any maximal consecutive run of spaces in an alignment. The length of a gap is the number of indel operations in it. a t t c - - g a - t g g a c c a - - c g t g a t t - - - c c Pairwise Alignment p.51/55
Motivation: Insertion or deletion of an entire sequence often occurs as a single mutation event. Two protein sequences might be relatively similar over several intervals. cdna: the complement of mrna Pairwise Alignment p.52/55
Gap Penality Models 1. constant gap penalty model: W g #gaps 2. affine gap penalty model: (y = ax + b) W g #gaps + W s #spaces 3. convex gap penalty model: W g + log(q) where q is the length of the gap. 4. arbitrary gap penalty model W g : gap-open penalty, W s : gap-extension penalty Pairwise Alignment p.53/55
Complexity 1. constant gap penalty model: Time= O(mn) 2. affine gap penalty model: Time= O(mn) 3. convex gap penalty model: Time= O(mn lg(m + n)) 4. arbitrary gap penalty model: Time = O(mn(m + n)) Pairwise Alignment p.54/55
Conclusion Pairwise Alignment p.55/55