InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

Similar documents
THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Effects of Gap Open and Gap Extension Penalties

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches

Sequence Analysis '17- lecture 8. Multiple sequence alignment

Tools and Algorithms in Bioinformatics

Pairwise & Multiple sequence alignments

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

Large Grain Size Stochastic Optimization Alignment

Single alignment: Substitution Matrix. 16 march 2017

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

Sequence Bioinformatics. Multiple Sequence Alignment Waqas Nasir

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Algorithms in Bioinformatics

First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

Multiple sequence alignment

Thanks to Paul Lewis, Jeff Thorne, and Joe Felsenstein for the use of slides

In-Depth Assessment of Local Sequence Alignment

Constructing Evolutionary/Phylogenetic Trees

Constructing Evolutionary/Phylogenetic Trees

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

Quantifying sequence similarity

Motivating the need for optimal sequence alignments...

Dr. Amira A. AL-Hosary

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

Moreover, the circular logic

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Sequence analysis and Genomics

Biochemistry 324 Bioinformatics. Pairwise sequence alignment

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Bioinformatics. Dept. of Computational Biology & Bioinformatics

MULTIPLE SEQUENCE ALIGNMENT FOR CONSTRUCTION OF PHYLOGENETIC TREE

Sequence analysis and comparison

Alignment & BLAST. By: Hadi Mozafari KUMS

Overview Multiple Sequence Alignment

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

Comparison of Cost Functions in Sequence Alignment. Ryan Healey

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

Collected Works of Charles Dickens

Phylogenetic hypotheses and the utility of multiple sequence alignment

Sequence Alignment (chapter 6)

Copyright 2000 N. AYDIN. All rights reserved. 1

Computational Biology

Bioinformatics and BLAST

Concepts and Methods in Molecular Divergence Time Estimation

Phylogenetic Tree Reconstruction

Sequence Alignment Techniques and Their Uses

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

MegAlign Pro Pairwise Alignment Tutorials

Mul$ple Sequence Alignment Methods. Tandy Warnow Departments of Bioengineering and Computer Science h?p://tandy.cs.illinois.edu

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Phylogenetic inference

Multiple Sequence Alignment: A Critical Comparison of Four Popular Programs

Lecture 5,6 Local sequence alignment

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega

Bioinformatics for Computer Scientists (Part 2 Sequence Alignment) Sepp Hochreiter

BMI/CS 776 Lecture #20 Alignment of whole genomes. Colin Dewey (with slides adapted from those by Mark Craven)

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models

Molecular Evolution & Phylogenetics

SUPPLEMENTARY INFORMATION

Multiple Sequence Alignment

Pairwise sequence alignment

1.5 Sequence alignment

Week 10: Homology Modelling (II) - HHpred

Evolutionary Tree Analysis. Overview

A greedy, graph-based algorithm for the alignment of multiple homologous gene lists

C E N T R. Introduction to bioinformatics 2007 E B I O I N F O R M A T I C S V U F O R I N T. Lecture 5 G R A T I V. Pair-wise Sequence Alignment

EVOLUTIONARY DISTANCES

17 Non-collinear alignment Motivation A B C A B C A B C A B C D A C. This exposition is based on:

Michael Yaffe Lecture #5 (((A,B)C)D) Database Searching & Molecular Phylogenetics A B C D B C D

Phylogenetic analysis. Characters

Phylogenetics. BIOL 7711 Computational Bioscience

Introduction to Evolutionary Concepts

NUMB3RS Activity: DNA Sequence Alignment. Episode: Guns and Roses

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

Introduction to Bioinformatics Introduction to Bioinformatics

Analysis and Design of Algorithms Dynamic Programming

Phylogeny: building the tree of life


MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME

CS 394C Algorithms for Computational Biology. Tandy Warnow Spring 2012

Unsupervised Learning in Spectral Genome Analysis

Introduction to Comparative Protein Modeling. Chapter 4 Part I

Algorithmic Methods Well-defined methodology Tree reconstruction those that are well-defined enough to be carried out by a computer. Felsenstein 2004,

Today s Lecture: HMMs

An Introduction to Sequence Similarity ( Homology ) Searching

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University

Similarity or Identity? When are molecules similar?

Protein Threading. BMI/CS 776 Colin Dewey Spring 2015

Large-Scale Genomic Surveys

Pairwise sequence alignments

Background: comparative genomics. Sequence similarity. Homologs. Similarity vs homology (2) Similarity vs homology. Sequence Alignment (chapter 6)

C3020 Molecular Evolution. Exercises #3: Phylogenetics

Multiple Sequence Alignment. Sequences

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

Phylogeny Tree Algorithms

Transcription:

Lecture 5 Alignment I. Introduction. For sequence data, the process of generating an alignment establishes positional homologies; that is, alignment provides the identification of homologous phylogenetic characters. For nucleotide data, because there are only five (if we include gaps) possible character states for each character and the states are common across all the characters, establishing character homologies can be very challenging. Furthermore, there is no way to apply some of the criteria that are useful in positing homologies for morphological characters (transitional forms, developmental similarity). Thus, we re forced to rely on positional similarity as determined in our alignments to provide homologous characters. II. The ideal approach was discovered 40 years ago. It involves simultaneous alignment and tree estimation (Sankoff et al., 1973; Nature New Biology, 245:232). This is often called direct optimization and is actually compatible with the ideas of testing morphological hypotheses of homology with phylogenetic analyses we discussed last week. Alignment is essentially an evolutionary inference of insertion/deletion events (indels), and is cannot be attempted logically in a non-evolutionary framework. Note that not all computer scientists adopt this view. Let s think for a minute why alignment depends on the phylogeny (we ll use the example from Chapter 29 in the text). 1 ACCGAAT--ATTAGGCTC 2 AC---AT--AGTGGGATC 3 AA---AGGCATTAGGATC 4 GA---AGGCATTAGCATC 5 CACGAAGGCATTGGGCTC So, it looks like there have be two insertion/deletion events, one at positions 3-5 and one at positions 8-9, but there s no tree on which both can evolve only one time. 2 InDel 3-5 5 InDel 3-5 4 5 2 InDel 8-9 4 InDel 8-9 1 InDel 8-9 3 1 InDel 3-5 3

So really, figuring out where insertions and deletions have occurred requires a historical framework (i.e., a phylogeny), which (at least in this class) is the motivation for collecting sequence data in the first place. We can think about alternative alignments in terms of an overall score, D. This includes a score for matches/mismatches and a cost for gaps: A simple score might look like this: D = s + wg. s is the score for a matches, g is a gap cost and w is a weighting factor for gaps. We ll worry about the details of these values in few minutes, but the ideal goal would be to find that topology that has the best D, across all possible topologies and all possible alignments (i.e., find the globally optimal combination of topology + alignment; Sankoff et al. 1973). However, given that phylogenetic estimation from a single alignment is a problem that is so computationally difficult as to require heuristic methods (short cuts), simultaneous alignment is really not feasible. III. The most widely use heuristic is called progressive alignment, and the most commonly used program is Clustal (now in its Ω version). A. Overview. Alignment occurs in three steps. 1. A pair-wise alignment is generated for all (n 2 -n)/2 pairs of sequences. A pair-wise distance is estimated between each of the pairs and this is entered into a distance matrix. 2. That distance matrix is subjected to an algorithmic tree building procedure (usually NJ or UPGMA) to build a guide tree. 3. This guide tree is used to align progressively more distant pairs of sequences, until all sequences are in the alignment. B. Pair-wise alignments are conducted using Needleman-Wunsch (1970) algorithm. Take two sequences: G A A T T C A G T T A (#1) G G A T C G A (#2) So L 1 = 11 and L 2 = 7

To illustrate, we can assume a simple scoring scheme (i.e., set of alignment parameters). D = s + wg S i,j = 1 if the nt at position i of sequence #1 is the same as that at position j of sequence #2 S i,j = 0 (mismatch score) g = 0 (gap penalty) w = 1 (gap weight) The idea behind Needleman-Wunsch is to generate a matrix with the two sequences to be aligned along the two axes, so in this case we have an 11 x 7 matrix. We initialize it with a row and column of zeroes (gap cost). G A A T T C A G T T A 0 0 0 0 0 0 0 0 0 0 0 0 A 0 T 0 C 0 A 0 The matrix is filled by finding the maximal score M i,j for each cell, where M i,j = Maximum value of: 1 is the maximum value, So: M i-1, j-1 + S i,j (match/mismatch in the diagonal), 0 + 1 = 1 M i,j-1 + wg (gap in sequence #1; vertical arrow), 0 + 0 = 0 M i-1,j + wg (gap in sequence #2; horizontal arrow)] 0 + 0 = 0 G A A T T C A G T T A 0 0 0 0 0 0 0 0 0 0 0 0 1 A 0 T 0 C 0 A 0

We can therefore fill in the matrix: G A A T T C A G T T A 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 A 0 1 2 2 2 2 2 2 2 2 2 3 T 0 1 2 2 3 3 3 3 3 3 3 3 C 0 1 2 2 3 3 4 4 4 4 4 4 1 2 2 3 3 4 4 5 5 5 5 A 0 1 2 3 3 3 3 4 5 5 5 6 Now, we trace back from the bottom right to the top left, following the path with the highest score. G A A T T C A G T T A 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 A 0 1 2 2 2 2 2 2 2 2 2 3 T 0 1 2 2 3 3 3 3 3 3 3 3 C 0 1 2 2 3 3 4 4 4 4 4 4 1 2 2 3 3 4 4 5 5 5 5 A 0 1 2 3 3 3 3 4 5 5 5 6 This gives the alignment: - G A A T T C A G T T A G G - A T - C - G - - A Which has a score of 6 (there are six matches). All paths with a score of 6 represent optimal pairwise alignments, with the alignment parameters that we assumed. So in a multiple alignment, this is done for all (n 2 -n)/2 pairwise comparisons of sequences. These pairwise alignments are used to estimate pairwise genetic distances, which are entered into a pairwise matrix (like the one you saw Tuesday). C. This matrix is then subjected to an algorithmic method of generating a tree.

The Neighbor Joining method is usually used in Clustal. This is a star decomposition approach (that we'll learn soon), so it s fast. D. This NJ tree is used to build a multiple alignment progressively, first aligning the closest sequences on the guide tree, and progressively aligning sequences that are more distant. Guide Tree 1 ACCGAATATTAGGCTC 2 ACATAGTGGGATC 3 AAAGGCATTAGGATC 4 GAAGGCATTAGCATC 5 CACGAAGGCATTGGGCTC Following the Guide Tree, 1 & 2 are aligned. 1 ACCGAATATTAGGCTC 2 AC---ATAGTGGGATC Again, following the guide tree, we align sequence 3 to the fixed 1/2 alignment. 1 ACCGAAT--ATTAGGCTC 2 AC---AT--AGTGGGATC 3 AA---AGGCATTAGGATC We continue to follow the guide tree building a multiple alignment by progressive pairwise alignments. 1 ACCGAAT--ATTAGGCTC 2 AC---AT--AGTGGGATC 3 AA---AGGCATTAGGATC 4 GA---AGGCATTAGCATC 5 CACGAAGGCATTGGGCTC Now we have erected a series of hypothesized positional homologies and we can begin phylogeny inference. E. Elaborations: - A substitution matrix is used to calculate the scores for substitutions. This allows for biochemically similar amino acid substitutions to cost less than radical ones (for protein alignments) or transitional substitutions to cost less than transversional substitutions (for DNA alignments).

- Gap costs can be elaborated so that there is a separate cost associated with opening a new gap (GOP) vs, extending a gap (GEP). - Each sequence can be weighted according to how different it is from the other sequences. This accounts for the case where one specific subfamily is over represented in the data set. - Position-specific gap-open penalties are modified according to residue type, using empirical observations in a set of alignments based on 3D structures. In general, hydrophobic residues have higher gap penalties than hydrophilic, since they are more likely to be in the hydrophobic core, where gaps should not occur. - The largest portion of time is invested in conducting all the pair-wise N-W alignments to generate an initial distance matrix. There are several important shortcuts. - MUSCLE Uses k-mer words (usually k = 6) to derive a distance (just like the Edwards et al. method). This is much faster than doing a bunch of pair-wise NW alignments, and works just as well. As you might expect, the more similar sequences are, the more hexamers they ll have in common. Figure 2. This diagram summarizes the ow of the MUSCLE algorithm. There are three main stages: Stage 1 (draft progressive), Stage 2 (improved progressive) and Stage 3 (re nement). A multiple alignment is available at the completion of each stage, at which point the algorithm may terminate.

There are lots of issues to address, and we ll do so in lab. F. Models in Alignment One problem is that the final alignment is dependent on the parameters, and the optimum parameter values depend on unknown processes of evolution (i.e., insertion rate, deletion rate, and rates of various substitution types). The Catch-22 is that we need a good tree to estimate these. This is because, as discussed in the beginning of this lecture, alignment and historical information are not independent of each other. A small but growing body of work attempts to model this so that we can simultaneously optimize parameters and alignments. These are extremely computationally intensive. There are two approaches: 1. SATé (& SATé II: Liu et al. 2012. Syst. Biol. 61:90) is an iterative (successive approximations) approach that bundles MAFFT, MUSCLE & RAxML (which we ll discuss later). It uses models of nucleotide substitution, including rate variation among sites (which we ll also talk about) and ML estimation of intermediate trees. However, its improvements in alignment are mostly derived by the subtree pruning (like we just saw); this is pretty extensive and permits a maximum of 32 taxa per subtree. 2. Fully parametric Bayesian approaches have been developed (Redelings and Suchard 2005; Suchard and Redelings 2006)) that conduct simultaneous alignment and phylogeny estimation, and these have the advantage that they integrate across uncertainty in alignments. explores a vast state space: ω = (Y,A,τ,T,Θ,Λ). Y = Unaligned sequences A = Alignment t = The tree T = Its branch lengths Θ = The substitution model Λ = The model of insertion/deletion

At this point, though, they re only applicable to pretty moderate data sets. It actually provides an explicit evaluation of the uncertainty in alignment sites. Finally, computational methods have a difficult time producing alignments that maintain secondary structure appropriately, especially for rrna sequences. We ll read papers that address this. A very abbreviated list of relevant references. Edgar, R.C. 2004. Local homology recognition and distance measures in linear time using compressed amino acid alphabets. Nucleic Acids Res., 32:380-385. Edgar, R. C. 2004. MUSCLE: multiple sequence alignment with high accuracy and high throughput. 32:1792-1797. Feng, D.-F. and R. F. Doolittle. 1987. Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J. Mol. Evol. 25: 351-360. Hein, J. J., L. Jensen, & C. N. S. Pedersen. 2003. Recursions for statistical multiple alignment. PNAS 100:14950-14965. Metzler, D., R. Fleißner, A. Wakolbinger, & A. von Haeseler. 2001. Assessing variability by joint sampling of alignments and mutation rates. J. Mol. Evol. 53:660-669. Needleman, S. B. & C. D. Wunsch. 1970. A general method applicable to the search for similarities in the amino acids sequence of two proteins. J. Mol. Biol. 48:443-453. Redelings, B. D. and M. A. Suchard. 2005. Joint Bayesian estimation of elignment and phylogeny. Systematic Biology, 54:401-418. Smith, T. F. & M. S. Waterman. 1982. Identification of common molecular subsequences. J. Mol.. Biol. 147:195-197.

Thompson, J.D., Higgins, D.G. and Gibson, T.J. 1994. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position specific gap penalties and weight matrix choice. Nucleic Acids Res. 22:4673-80. Wheeler, W. 2001. Homology and the optimization of DNA sequence data. Cladistics,17:S3- S11. http://bimas.dcrt.nih.gov/clustalw/clustalw.html http://www.sbc.su.se/~per/molbioinfo2001/seqali-dyn.html