Multiple Sequence Alignment (MAS)

Similar documents
Multiple Alignment. Slides revised and adapted to Bioinformática IST Ana Teresa Freitas

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Lecture 2: Pairwise Alignment. CG Ron Shamir

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

Algorithm Design and Analysis

Sequence Bioinformatics. Multiple Sequence Alignment Waqas Nasir

Sequence Comparison. mouse human

20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, Global and local alignment of two sequences using dynamic programming

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

Lecture 4: September 19

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment


CFG PSA Algorithm. Sequence Alignment Guided By Common Motifs Described By Context Free Grammars

Multiple Sequence Alignment using Profile HMM

Moreover, the circular logic

Pairwise sequence alignment

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

Sequence Alignment (chapter 6)

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Introduction to sequence alignment. Local alignment the Smith-Waterman algorithm

Protein Threading. BMI/CS 776 Colin Dewey Spring 2015

Pairwise alignment, Gunnar Klau, November 9, 2005, 16:

Pair Hidden Markov Models

Introduction to Sequence Alignment. Manpreet S. Katari

Application of the LCS Problem to High Frequency Financial Data

Collected Works of Charles Dickens

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

Heuristic Alignment and Searching

Synthesis of 2-level Logic Exact and Heuristic Methods. Two Approaches

Phylogenetic trees 07/10/13

Transportation Theory and Applications

Dynamic Programming. Prof. S.J. Soni

EECS730: Introduction to Bioinformatics

Sequence analysis and Genomics

First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

Local Alignment: Smith-Waterman algorithm

A greedy, graph-based algorithm for the alignment of multiple homologous gene lists

Implementing Approximate Regularities

Multiple Sequence Alignment

Unsupervised Vocabulary Induction

Global alignments - review

Tools and Algorithms in Bioinformatics

Single alignment: Substitution Matrix. 16 march 2017

Evolutionary Tree Analysis. Overview

Background: comparative genomics. Sequence similarity. Homologs. Similarity vs homology (2) Similarity vs homology. Sequence Alignment (chapter 6)

Application of new distance matrix to phylogenetic tree construction

CS473 - Algorithms I

Phylogenetics: Building Phylogenetic Trees

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

CSE182-L7. Protein Sequence Analysis Patterns (regular expressions) Profiles HMM Gene Finding CSE182

Did you know that Multiple Alignment is NP-hard? Isaac Elias Royal Institute of Technology Sweden

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

A Survey of the Longest Common Subsequence Problem and Its. Related Problems

Evaluation Measures of Multiple Sequence Alignments. Gaston H. Gonnet, *Chantal Korostensky and Steve Benner. Institute for Scientic Computing

Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17:

In-Depth Assessment of Local Sequence Alignment

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Dynamic Programming II Date: 10/12/17

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Building Phylogenetic Trees UPGMA & NJ

Sequence comparison: Score matrices

Chapter 6. Weighted Interval Scheduling. Dynamic Programming. Algorithmic Paradigms. Dynamic Programming Applications

Biochemistry 324 Bioinformatics. Pairwise sequence alignment

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

Lecture 1, 31/10/2001: Introduction to sequence alignment. The Needleman-Wunsch algorithm for global sequence alignment: description and properties

Computational Complexity. This lecture. Notes. Lecture 02 - Basic Complexity Analysis. Tom Kelsey & Susmit Sarkar. Notes

Alignment Strategies for Large Scale Genome Alignments

Copyright 2000 N. AYDIN. All rights reserved. 1

Stephen Scott.

More Dynamic Programming

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Study and Implementation of Various Techniques Involved in DNA and Protein Sequence Analysis

More Dynamic Programming

Algorithms in Bioinformatics

Bioinformatics and BLAST

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models

Pairwise alignment. 2.1 Introduction GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSD----LHAHKL

C E N T R. Introduction to bioinformatics 2007 E B I O I N F O R M A T I C S V U F O R I N T. Lecture 5 G R A T I V. Pair-wise Sequence Alignment

Time/Memory Tradeoffs

Counting k-marked Durfee Symbols

Sequence alignment methods. Pairwise alignment. The universe of biological sequence analysis

Lecture 5: September Time Complexity Analysis of Local Alignment

Algorithm Design CS 515 Fall 2015 Sample Final Exam Solutions

An Introduction to Sequence Similarity ( Homology ) Searching

Sequence Alignment Techniques and Their Uses

EVOLUTIONARY DISTANCES

Lecture 13. More dynamic programming! Longest Common Subsequences, Knapsack, and (if time) independent sets in trees.

Algorithms in Bioinformatics

Copyright 2000, Kevin Wayne 1

Math Models of OR: Branch-and-Bound

Partha Sarathi Mandal

Hidden Markov Models

Sequence Alignment. Johannes Starlinger

EECS730: Introduction to Bioinformatics

Pairwise & Multiple sequence alignments

Evolutionary Models. Evolutionary Models

Introduction to Computation & Pairwise Alignment

DNA Phylogeny. Signals and Systems in Biology Kushal EE, IIT Delhi

Multiple Sequence Alignment

Transcription:

Multiple Sequence lignment (MS) Group-to-group alignments Steven driaensen & Ken Tanaka

References Osamu Goto Optimal lignment between groups of sequences and its application to multiple sequence alignment (1993) Further improvements in methods of group-to-group sequence alignment with generalized profile operations (1994)

Context Why do Multiple Sequence lignment? Reducing uncertainties Better for identifying Similarities

Context How to do Multiple Sequence lignment Generalization of standard DP algorithms Meta-heuristic Optimizations Progressive lignment pproach (e.g. the Clustal system)

Context Progressive lignment Step 1 discussed in the first lecture Calculate a distance matrix between all sequence pairs distance between a pair of sequences alignment score aligning of two sequences - global alignment: Needleman-Wunsh - local alignment: Smith-Waterman

Context Progressive lignment Step 2 discussed in the second lecture Constructing a guide tree (phylogenetic trees) tree describing the relationships between sequences - Fitch-Margoliash length of branches distance between sequences

Context Progressive lignment Step 3 We now have a guide tree, whats next? Grouplign = group-to-group alignment - the 4 algorithms by Osamu Gotoh

Group-to-group alignments = a 11 a 1I B = a M1 a MI b 11 b 1J b N1 b NJ a 1 a 2 e.g. N = b 1 b path: 1 3 2 J where 3 is a match and 1, 2 a gap in group, B resp. In order to find such an alignment we need 2 things: 1. Measure of how good an alignment is. 2. n algorithm that creates an alignment approximately optimal to this measure.

How good is an alignment? We have seen a score for an alignment of 2 sequences extend to multiple. ffine gap penalty: gap penalty = u gap length v d a mi, a ki = Sum of Pairs: M m 1 SP = S mk (a mi = a ki = ) u (a mi = a ki = ) d a mi, a ki (else) S mk = d(a mi, a ki ) v g mk m=2 k=1 i=1 Where g mk is the number of gaps in the pair of sequences (m,k). In context of group-to-group alignments we have SP N = SP SP B SP. B SP. B = S mn I M N m=1 n=1

Mind the gap(s) If v we must be able to compute g mk. What is g for these pair of sequences? C C G G B B C D B C B Removing matching from a pair of sequences should not affect SP and by consequence g. s these were not introduced in aligning these sequences. nswer: 5 C C G G B B C D B C B

lgorithms: Notations & remarks D ij : The score obtained by the algorithm for the alignment of {a 1,, a i } and {b 1,, b j } given that the last segment of the alignment path was. In algorithm -C these are the candidates retained for a sub-alignment ij. In algorithm D the semantics of the superscript are extended to denote any candidate retained for a sub-alignment ij. The score returned by the algorithms is the Score(.B) Only for C and D we have Score = SP. a mij is the last symbol in the m th row of group in a sub-alignment ij. It is equal to a i if = 1, 3 and if = 2. b nij is the last symbol in the n th row of group B in a sub-alignment ij. It is equal to b j if = 2, 3 and if = 1. e.g. For path path: 1 3 2, N = a 1 a 2 b 1 b J = a 1 a 21 a IJ b IJ b 1 b 21 ll algorithms obtain the same result if nor B contains gaps or if v =!

lgorithm Backtrack: Keep a pointer to predecessor for each candidate retained. Reduce memory: Replace records no longer required. (only keep J2 last computed records) Let be the group with the longest sequences (minimize J). Backtrack: Keep path for every candidate retained.

lgorithm D 1 3 ij = min D i 1,j 3 1 V, D i 1j d(a i, ) D 2 ij = min D i,j 1 V, D i,j 1 d(, b j ) D 3 ij = min D i 1,j 1 d(a i, b j ) V = MNv d(a ij, b ij ) = M m=1 N n=1 d(a mij, b nij ) record = {D ij 1, D ij 2, D ij 3 } 2

Gap openings and lgorithm lgorithm treats internal gaps in groups, B as an ordinary symbol. From a gap-opening perspective, there are no internal gaps. Count-rule: change in path 3 1, 3 2 is assumed to open a gap in every pairwise comparison. (V = MNv). e.g. path: [3,1]: counts NM = 2x2 = 4 Overestimate: e.g. path: [3,1]: counts NM = 2x2 = 4 Underestimate: e.g. path: [3,3]: counts a 11 a 12 a 21 a 22 b 11 b 21 a 11 a 12 a 21 a 22 b 11 b 21 a 11 = = a 12 a 21 a 22 b 11 b = 12 b 21 b 22 B B B B B actual gaps 4 actual gaps 1 actual gaps 2 m\n 1 2 1 2 m\n 1 2 1 2 m\n 1 2 1 2

Complexity of lgorithm Naively O(NMIJ) = O(NML²) O(L²) assuming you use profiles d(a ij, b ij ) = f B x X xi p xj. M = = x) f x,i m=1 (a mij p B x,j = a ij = a i B y X d(x, y) f yj = a i with f i = f i f a ij = and p i = p i if a ij p if a ij = f i, f B j, p i, p B j can be precomputed (outside loops) Essentially we are rewriting a sum with a fixed number of different terms as a dot product: e.g. 222225555 = 5*24*5 p x,j B gives the possible values for terms 2 5 and f x,i their frequencies 5 4. Beware of large constant X!

lgorithm B D ij 1 = min β=1,3 D ij 2 = min β=2,3 D ij 3 = min β=1,2,3 β D i 1,j β D i,j 1 β D i 1,j 1 RecordB = {D ij 1, D ij 2, D ij 3 } g ij β1 v d(ai, ) g ij β2 v d(, bj ) g ij β3 v d(ai, b j ) g i,j β is the estimated # new gaps introduced given the last 2 path segments are β.

Gap openings in algorithm B Gap openings: Estimates new gaps based on whether or not the symbols at the 2 last positions for every pair of sequences are gaps in the evaluated candidate alignment. B C C D? B C C? G G? B B

g ij β M N g 11 ij = ( γ 11 q m,i 1 )(1 q m,i ) m=1 n=1 M N g 13 ij = (γ 13 q m,i 1 )(1 q m,i )r n,j m=1 n=1 M N g ij 22 = (γ 22 r n,j 1 )(1 r n,j ) g ij 23 = m=1 n=1 M N m=1 n=1 M N (1 q m,i 1 γ 13 q m,i 1 )(1 r n,j )q m,i (1 r n,i,j 1 γ 23 r n,j 1 )(1 q m,i )r n,j (γ 23 r n,j 1 ) (1 r n,j )q m,i g ij 31 = (1 r n,j 1 γ 31 q m,i 1 r n,j 1 )(1 q m,i ) m=1 n=1 M N g ij 32 = (1 q m,i 1 γ 32 r n,j 1 q m,i 1 )(1 r n,j ) g ij 33 = m=1 n=1 M N m=1 n=1 (1 r n,j 1 γ 33 q m,i 1 r n,j 1 )(1 q m,i )r n,j (1 q m,i 1 γ 33 r n,j 1 q m,i 1 )(1 r n,j )q m,i q m,i = a m,i = r n,j = b n,j = γ β 1

Complexity of algorithm B Complexity: Naïve O(MNL²) Using profiles both d(a ij, b ij ) and g ij β can be computed using fixed #steps, resulting in O(L²)

lgorithm C D ij 1 = min β=1,3 D ij 2 = min β=2,3 D ij 3 = min β=1,2,3 β D i 1,j β D i,j 1 β D i 1,j 1 g ij β1 v d(ai, ) g ij β2 v d(, bj ) g ij β3 v d(ai, b j ) g ij β is the actual # new gaps introduced given the last 2 path segments are β. Q ij, R ij give for every sequence in sub-alignment ij of, B the # consecutive gaps it has at the end. RecordC = {D ij,q ij, R ij } ( = 1,2,3)

Gap openings and algorithm C B 1 1 1 2 2 C C 1 1 D 2 B C 1 1 2 C 2 G 1 1 2 G 3 B B 1 1 We count a new gap in pairwise sequence comparison if we have non-matching gaps and we are not extending a gap we already counted. The Score(,B) = SP(.B) Note however that some gaps only get counted when they are closed. This extra cost present in the full alignment, but not the sub-alignment is called the retarded gap penalty. Because of the retarded gap penalty, a sub-alignment ij of the optimal alignment IJ is not guaranteed to be a retained alignment for ij.

The 55 sub-alignment of the optimal alignment is not the optimal 55 subalignment. However as the better alternative has a different last path segment it is still retained.

The 66 sub-alignment of the optimal alignment is not the optimal 66 subalignment. The better alternative now has the same last path segment and is retained instead.

We do not find the optimal group-to-group alignment.

g ij β Let Q, R be the values for Q, R of the predecessor sub alignment β. We then have M N g ij β1 = 1 qm,i Q m R n g ij β2 = m=1 n=1 M N m=1 n=1 M N (1 rn,j ) (Q m R n) g ij β3 = 1 qm,i r n,j Q m R n (1 r n,j )q m,i (Q m R n) m=1 n=1 Let Q,R be the values for Q, R of the predecessor sub alignment of the retained candidate. Q ij, R ij are then computed as follows: Q m,ij = Q m 1 (if q m,i ) (else) R n,ij = R n 1 (if r n,j ) (else)

Complexity of algorithm C Naïve O(MNL²) Problem, not straightforward to use profiles to compute g ij β. We have O(MNL²). In Gotoh(1994) a concept of generalized profiles is introduced. chieving a complexity not directly dependent on M, N.

lgorithm D D retains a dynamic set of good candidates (candidate list paradigm), using a series of tests τ 1, τ 2, τ 3 and τ 4 Let T be the filter based on τ 1, τ 2, τ 3 and τ 4. Let us call EV ij the set of candidates evaluated and RE ij the set of candidates retained. EV 1 β ij = D i 1,j β RE i 1,j g ij β1 v d(ai, ) EV 2 β ij = D i,j 1 g ij β2 v d(, bj ) β RE i,j 1 EV 3 β ij = D i 1,j 1 β RE i 1,j 1 EV ij = EV ij 1 EV ij 2 EV ij 3 g ij β3 v d(ai, b j ) RE ij = T(EV ij ) RecordD = {D ij,q ij, R ij } ( ϵ RE ij ) Where g ij β is calculated as in algorithm C.

Candidate Tests in algorithm D 4 necessary conditions for being a sub-alignment of an optimal alignment. lgorithm D therefore returns the optimal alignment for two groups. Call C ij the set of candidates that haven t failed any prior test. l = minarg C ij (D ij) τ 1 = M N m=1 n=1 (Q m,ij Rn,ij ) < Qm,ij l R l n,ij m = minarg C ij (E ij ) where Eij = Dij v τ 2 = M N m=1 n=1 (Q m,ij Rn,ij m ) > Qm,ij m Rn,ij n = minarg C ij (E ij ) where Fij = Dij v τ 3 = M N m=1 n=1 (Q m,ij Rn,ij m ) < Qm,ij m Rn,ij = minarg C ij (G ij ) where Gij = Dij v τ 4 = (G ij > Dij ) or Q m,ij M m=1 N n=1 or Q m,ij M m=1 N n=1 or Q m,ij M m=1 N n=1 Rn,ij Qm,ij Rn,ij Qm,ij Rn,ij Qm,ij < Q l m,ij Rn,ij m < Q m,ij Rn,ij m > Q m,ij Rn,ij Test order τ 4 τ 1 τ 2 τ 3... (until none removed or C ij = 1) Note that l, m, n, always automatically pass their corresponding test. R l n,ij m Rn,ij m Rn,ij (Q m,ij Rn,ij )

Complexity algorithm D No worst case complexity known. (obvious upper bound: retains all) verage case: O(MNL²) Evaluates/retains less candidates than algorithms -C in practice. Evaluation of candidates takes more time. Using the profiles for C, allow us to compute D β ij and g ij in a time not directly dependent on MN. However the tests in D require MN time: Calculating the test values E ij, F ij, G ij O( C ij MN) Performing the tests τ 1, τ 2, τ 3 O( C ij MN) Gotoh (1994) rewrites these computations and tests using generalized profiles to remove these direct depencies between computation time and MN.

Experiments

Results General Tradeoff speed vs. accuracy Speed > B > C > D more procedural complexity ccuracy D > C > B > more accuracy in gap penalty estimation

Results & Considerations relatedness vs. accuracy for distantly related sequences C, D is better (generalized) profiles vs. no profiles ll have variants using profiles but additional complexity only when M.N is high

Conclusions Which algorithm to use? can depend on the groups to be aligned e.g. No internal gaps = B = C = D Distantly related D > C > B > M.N is high (D > C) with generalized profiles >... can depend on preferences e.g. Has to be simple (procedurally) > B > C > D Has to be efficient (time/memory-wise) > B >... Has to be accurate D > C >...