Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

Similar documents
Lecture 2: Pairwise Alignment. CG Ron Shamir

Pairwise sequence alignment

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, Global and local alignment of two sequences using dynamic programming

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Sequence analysis and Genomics

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Bioinformatics for Computer Scientists (Part 2 Sequence Alignment) Sepp Hochreiter

8 Grundlagen der Bioinformatik, SoSe 11, D. Huson, April 18, 2011

Local Alignment: Smith-Waterman algorithm

8 Grundlagen der Bioinformatik, SS 09, D. Huson, April 28, 2009

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Algorithms in Bioinformatics

Practical considerations of working with sequencing data

Pairwise alignment, Gunnar Klau, November 9, 2005, 16:

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

Lecture 5,6 Local sequence alignment

Single alignment: Substitution Matrix. 16 march 2017

Algorithms in Bioinformatics: A Practical Introduction. Sequence Similarity

Analysis and Design of Algorithms Dynamic Programming

Computational Biology

Sequence Comparison. mouse human

Bio nformatics. Lecture 3. Saad Mneimneh

Collected Works of Charles Dickens

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre

Biochemistry 324 Bioinformatics. Pairwise sequence alignment

Motivating the need for optimal sequence alignments...

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

Lecture 5: September Time Complexity Analysis of Local Alignment

Sequence analysis and comparison

Pairwise & Multiple sequence alignments

Pairwise Sequence Alignment

BIO 285/CSCI 285/MATH 285 Bioinformatics Programming Lecture 8 Pairwise Sequence Alignment 2 And Python Function Instructor: Lei Qian Fisk University

In-Depth Assessment of Local Sequence Alignment

First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences

Global alignments - review

Evolution. CT Amemiya et al. Nature 496, (2013) doi: /nature12027

Local Alignment Statistics

Moreover, the circular logic

Sequence Alignment: Scoring Schemes. COMP 571 Luay Nakhleh, Rice University

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Sequence Alignment (chapter 6)

An Introduction to Sequence Similarity ( Homology ) Searching

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models

C E N T R. Introduction to bioinformatics 2007 E B I O I N F O R M A T I C S V U F O R I N T. Lecture 5 G R A T I V. Pair-wise Sequence Alignment

Scoring Matrices. Shifra Ben-Dor Irit Orr

Pairwise sequence alignments

Pairwise alignment using HMMs

Substitution matrices

Tools and Algorithms in Bioinformatics

Similarity or Identity? When are molecules similar?

Bioinformatics and BLAST

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Pairwise alignment. 2.1 Introduction GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSD----LHAHKL

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)

Biologically significant sequence alignments using Boltzmann probabilities

EECS730: Introduction to Bioinformatics

Sequence Database Search Techniques I: Blast and PatternHunter tools

Lecture 1, 31/10/2001: Introduction to sequence alignment. The Needleman-Wunsch algorithm for global sequence alignment: description and properties

Pairwise sequence alignments. Vassilios Ioannidis (From Volker Flegel )

Computational Molecular Biology

Alignment & BLAST. By: Hadi Mozafari KUMS

Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17:

BINF 730. DNA Sequence Alignment Why?

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Lecture 4: September 19

Introduction to Computation & Pairwise Alignment

Introduction to sequence alignment. Local alignment the Smith-Waterman algorithm

Algorithm Design and Analysis

String Matching Problem

... and searches for related sequences probably make up the vast bulk of bioinformatics activities.

Sequence Alignment Techniques and Their Uses

Alignment Algorithms. Alignment Algorithms

Hidden Markov Models

Large-Scale Genomic Surveys

Minimum Edit Distance. Defini'on of Minimum Edit Distance

2 Pairwise alignment. 2.1 References. 2.2 Importance of sequence alignment. Introduction to the pairwise sequence alignment problem.

Finding the Best Biological Pairwise Alignment Through Genetic Algorithm Determinando o Melhor Alinhamento Biológico Através do Algoritmo Genético

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Linear-Space Alignment

CSE 202 Dynamic Programming II

Introduction to Bioinformatics

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB

Introduction to Bioinformatics Algorithms Homework 3 Solution

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Approximation: Theory and Algorithms

Background: comparative genomics. Sequence similarity. Homologs. Similarity vs homology (2) Similarity vs homology. Sequence Alignment (chapter 6)

Algorithms in Bioinformatics I, ZBIT, Uni Tübingen, Daniel Huson, WS 2003/4 1

Week 10: Homology Modelling (II) - HHpred

CS 580: Algorithm Design and Analysis

EECS730: Introduction to Bioinformatics

Pairwise sequence alignment and pair hidden Markov models

Hidden Markov Models

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Study and Implementation of Various Techniques Involved in DNA and Protein Sequence Analysis

Transcription:

Pairwise Alignment Guan-Shieng Huang shieng@ncnu.edu.tw Dept. of CSIE, NCNU Pairwise Alignment p.1/55

Approach 1. Problem definition 2. Computational method (algorithms) 3. Complexity and performance Pairwise Alignment p.2/55

Motivations Reconstructing long sequences of DNA form overlapping sequence fragments Determining physical and genetic maps from probe data under various experiment protocols Database searching Pairwise Alignment p.3/55

Comparing two of more sequences for similarities Protein structure prediction (building profiles) Comparing the same gene sequenced by two different labs Pairwise Alignment p.4/55

Similarity & Difference 1. Common Ancestor Assumption 2. Mutation: (a) substitution (transition, transversion) (b) deletion (c) insertion We use indel to refer to deletion or insertion. Pairwise Alignment p.5/55

What is the difference between acctga and agcta? acctga agctga agct - a Pairwise Alignment p.6/55

Key Issues 1. notion of similarity/difference 2. the scoring system used to rank alignments 3. the algorithm used to find optimal scoring alignment 4. the statistical method used to evaluate the significance of an alignment score Pairwise Alignment p.7/55

Measure similarity by 1. substitution: 1 2. indel: 2 3. match: +1 Edit Distance a c c t g a a g c t - a 1-1 1 1-2 1 = 1 Pairwise Alignment p.8/55

a c c t g a a - g c t a 1-2 -1-1 -1 1 = 3 a c c t g a - a g c t a -2-1 -1-1 -1 1 = 5 Pairwise Alignment p.9/55

x: x 1 x 2 x 3... x m y: y 1 y 2 y 3... y n Alphabet: Σ = {A, G, C, T } for DNA sequence Σ = {A, G, C, U} for RNA sequence Σ = {A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y } for proteins Pairwise Alignment p.10/55

s(a, b): the score to substitute a by b s(a, ): delete a s(, b): insert b Pairwise Alignment p.11/55

Nomenclature BIOLOGY COMPUTER SCIENCE - sequence - string, word - subsequence - substring (contiguous) - N/A - subsequence - N/A - exact matching - alignment - inexact matching Pairwise Alignment p.12/55

Algorithm for Pairwise Alignment To find the best alignment (with the highest score) through Brute-force Dynamic programming Pairwise Alignment p.13/55

Brute-force Algorithm Try all possible alignments of x and y. F (m, n) = F (m 1, n) + F (m, n 1) + F (m 1, n 1) k = k 1 + k 1 l l 1 l m + n = m + n 1 + m + n 1 m m 1 m C(m, n) = C(m 1, n) + C(m, n 1) F (m, n) C(m, n) = m + n m, 2n n 22n πn. Pairwise Alignment p.14/55

Dynamic Programming Approach F (i, j): the score for the best alignment between x 1... x i and y 1... y j. F (i, j) = max F (i 1, j 1) + 1, F (i 1, j 1) 1, F (i 1, j) 2, F (i, j 1) 2, x i = y i (match) x i y i (substitution) align x i with a gap align y j with a gap Pairwise Alignment p.15/55

{ x1 x 2... x i 1 x i y 1 y 2... y j 1 y j F (i 1, j 1) + s(x i, y i ) { x1 x 2... x i 1 x i y 1 y 2... y j F (i 1, j) d { x1 x 2... x i y 1 y 2... y j 1 y j F (i, j 1) d Pairwise Alignment p.16/55

Alignment Graph F (i 1, j 1) F (i 1, j) +s(x i, y j ) d F (i, j 1) d F (i, j) Initial value: F (0, 0) = 0, F (0, j) = jd, F (i, 0) = id. Pairwise Alignment p.17/55

Example - a c c t g a - 0-2 -4-6 -8-10 -12 a -2 1-1 -3-5 -7-9 g -4-1 0-2 -4-4 -6 c -6-3 0 1-1 -3-5 t -8-5 -2-1 2 0-2 a -10-7 -4-3 0 1 1 Pairwise Alignment p.18/55

Example - a c c t g a - 0-2 -4-6 -8-10 -12 a -2 1-1 -3-5 -7-9 g -4-1 0-2 -4-4 -6 c -6-3 0 1-1 -3-5 t -8-5 -2-1 2 0-2 a -10-7 -4-3 0 1 1 backtrace Pairwise Alignment p.19/55

a c c t g a a g c t - a Pairwise Alignment p.20/55

Complexity 1. time = O(mn) 2. space= O(mn) if we need to find out the optimal alignment The problem for space is more serious when m and n are very large. Pairwise Alignment p.21/55

Linear-space Alignment Algorithm B(i, j): the best alignment score of the suffixes x m i+1... x m and y n j+1... y n F (i, j): forward matrix, B(i, j): backward matrix Then F (m, n) = max 0 k n {F (m 2, k) + B(m 2, n k)}. m 2 m 2 k n k Pairwise Alignment p.22/55

Algorithm 1. Compute F while saving the m 2 2. Compute B while saving the m 2 -th row. -th row. 3. Find the column k such that F ( m 2, k ) + B( m 2, n k ) = F (m, n). 4. Recursively partition the problem to two sub-problems: (a) Find the path from (0, 0) to ( m, 2 k ). (b) Find the path from ( m, 2 k ) to (m, n). Pairwise Alignment p.23/55

Example - a c c t g a - 0-2 -4-6 -8-10 -12 a -2 1-1 -3-5 -7-9 g -4-1 0-2 -4-4 -6 c -6-3 0 1-1 -3-5 t -8-5 -2-1 2 0-2 a -10-7 -4-3 0 1 1 (F (i, j) matrix) Pairwise Alignment p.24/55

- a g t c c a - 0-2 -4-6 -8-10 -12 a -2 1-1 -3-5 -7-9 t -4-1 0 0-2 -4-6 c -6-3 -2-1 1-1 -3 g -8-5 -2-3 -2 0-2 a -10-7 -4-3 -4-2 1 (B(i, j) matrix) Pairwise Alignment p.25/55

-4-1 0-2 -4-4 -6-6 -3-2 -1 1-1 -3 F ( m 2, k ) + B( m 2, n k ) = F (m, n). In this case, F (m, n) = 1 and k = 2. Hence, the best alignment of (acctga,agcta) is the concatenation of (ac,ag) and (ctga,cta). Pairwise Alignment p.26/55

Analysis of Complexity Clearly, the required space is O(min(m, n)). For time complexity, let T (m, n) be the time bound of the algorithm. Hence, we have T (m, n) = T ( m 2, k) + T ( m 2, n k) + O(mn) for some k. Pairwise Alignment p.27/55

T (m, n) = T ( m 2, k) + T (m 2, n k) + cmn) for some k. Suppose T (m, n) = αmn, then the right hand side becomes α m 2 k + αm αmn (n k) + cmn = + cmn. 2 2 Let α = 2c, then it equals to the left-hand side. Pairwise Alignment p.28/55

For more information on linear-space algorithms in pairwise alignment, see Chao, K. M., Hardison, R. C., and Miller, W. 1994. Recent developments in linear-space alignment methods: a survey. Journal of Computational Biology, 1:271 291. Pairwise Alignment p.29/55

Revisiting Dynamic Programming Principle of optimality Recurrence Bottom up Pairwise Alignment p.30/55

Substitution matrices Suppose we have two models: 1. random model 2. match model Given any two aligned sequences x = x 1 x 2... x n y = y 1 y 2... y n where x i is aligned with y i. Pairwise Alignment p.31/55

In random model R, we suppose each letter a occurs independently with some frequency q a. Hence, Pr(x, y R) = q xi q yj. i j In match model M, letters a and b are aligned with joint probability p ab. Suppose residues a and b have been derived indep. from some unknown residue c. Hence, Pr(x, y M) = i p xi y i. Pairwise Alignment p.32/55

Define the odds ratio as Pr(x, y M) Pr(x, y R) = i q x i The log-odds ratio: i p x i y i j q y j = i p xi y i q xi q yi. S = i s(x i, y i ) where s(a, b) = log( p ab q a q b ). S > 0 means that x, y are more likely to be an instance of the match model. (Maximum Likelihood) BLOSUM & PAM matrices for proteins Pairwise Alignment p.33/55

PAM matrices 1. Dayhoff, Schwartz, Orcutt (1978) 2. The most widely used matrix is PAM250. Pairwise Alignment p.34/55

Pairwise Alignment p.35/55

BLOSUM Matrices 1. Henikoff & Henikoff (1992) 2. Derived from a set of aligned, ungapped regions from protein families called the BLOCKS database. 3. BLOSUM62 is the standard for ungapped matching. 4. BLOSUM50 is better for alignment with gaps. Pairwise Alignment p.36/55

BLOSUM50 Pairwise Alignment p.37/55

Pairwise Alignment Problems 1. Global alignment (Needleman & Wunsch, 1970) 2. Local alignment (Smith-Waterman, 1981) 3. End-space free alignment 4. Gap penality The version we currently used was due to Gotoh (1982). Pairwise Alignment p.38/55

Global Alignment Given two sequences x and y, what is the maximum similarity between them? Find a best alignment. Pairwise Alignment p.39/55

Local Alignment Given two sequences x and y, what is the maximum similarity between a subsequence of x and a subsequence of y? Find most similar subsequences. Pairwise Alignment p.40/55

End-space Free Alignment or Pairwise Alignment p.41/55

Global Alignment F (i, j) = max F (i 1, j 1) + s(x i, y j ), F (i 1, j) d, F (i, j 1) d. with initial value F (0, 0) = 0, F (0, j) = jd, F (i, 0) = id. And F (m, n) is the score. Pairwise Alignment p.42/55

Example Pairwise Alignment p.43/55

Local Alignment Motivation: Ignore stretches of non-coding DNA. Protein domains Pairwise Alignment p.44/55

Local Alignment F (i, j) = max 0, F (i 1, j 1) + s(x i, y j ), F (i 1, j) d, F (i, j 1) d. with initial value F (0, 0) = F (0, j) = F (i, 0) = 0. And the highest value of F (i, j) over the whole matrix is the score. Pairwise Alignment p.45/55

Example Pairwise Alignment p.46/55

Ends-free Alignment Motivation: shotgun sequence assembly Pairwise Alignment p.47/55

Ends-free Alignment F (i, j) = max F (i 1, j 1) + s(x i, y j ), F (i 1, j) d, F (i, j 1) d. with initial value F (0, 0) = F (0, j) = F (i, 0) = 0. And the highest value of F (i, j) in the last column F (i, n) or the last row F (m, j ) is the score. Pairwise Alignment p.48/55

Example Pairwise Alignment p.49/55

Complexity All of the above algorithms can be implemented in time O(mn) and in space O(m + n). Pairwise Alignment p.50/55

Gap Penality A gap is any maximal consecutive run of spaces in an alignment. The length of a gap is the number of indel operations in it. a t t c - - g a - t g g a c c a - - c g t g a t t - - - c c Pairwise Alignment p.51/55

Motivation: Insertion or deletion of an entire sequence often occurs as a single mutation event. Two protein sequences might be relatively similar over several intervals. cdna: the complement of mrna Pairwise Alignment p.52/55

Gap Penality Models 1. constant gap penalty model: W g #gaps 2. affine gap penalty model: (y = ax + b) W g #gaps + W s #spaces 3. convex gap penalty model: W g + log(q) where q is the length of the gap. 4. arbitrary gap penalty model W g : gap-open penalty, W s : gap-extension penalty Pairwise Alignment p.53/55

Complexity 1. constant gap penalty model: Time= O(mn) 2. affine gap penalty model: Time= O(mn) 3. convex gap penalty model: Time= O(mn lg(m + n)) 4. arbitrary gap penalty model: Time = O(mn(m + n)) Pairwise Alignment p.54/55

Conclusion Pairwise Alignment p.55/55