Evaluation Measures of Multiple Sequence Alignments. Gaston H. Gonnet, *Chantal Korostensky and Steve Benner. Institute for Scientic Computing

Size: px
Start display at page:

Download "Evaluation Measures of Multiple Sequence Alignments. Gaston H. Gonnet, *Chantal Korostensky and Steve Benner. Institute for Scientic Computing"

Transcription

1 Evaluation Measures of Multiple Sequence Alignments Gaston H. Gonnet, *Chantal Korostensky and Steve Benner Institute for Scientic Computing ETH Zurich, 8092 Zuerich, Switzerland phone: fgonnet,korosteng@inf.ethz.ch

2 Abstract Multiple sequence alignments (msas) are frequently used in the study of families of protein sequences or DNA/RNA sequences. They are a fundamental tool for the understanding of structure, functionality and ultimately evolution of proteins. A new algorithm, the Circular Sum (CS) method, is presented for formally evaluating the quality of a multiple sequence alignment (msa). It is based on the use of a heuristic that approximates a solution to the Travelling Salesman Problem, which identies a Eulerian tour through an evolutionary tree connecting the sequences in a protein family. The algorithm gives an upper bound, the best score that can possibly be achieved by any msa for a given set of protein sequences. Alternatively, if presented with a specic msa, the algorithm provides a formal score for the msa, which serves as an absolute measure of the quality of the msa. The CS measure can be calculated without the need to construct a corresponding evolutionary tree. It does, however, yield a direct connection between a msa and the associated evolutionary tree, which allows us to use the measure to construct and evaluate evolutionary trees and as a tool for evaluating dierent methods for producing msas. A brief example of the last application is provided. Because it weights all evolutionary events on a tree identically, the CS algorithm has advantages over the frequently used sum-of-pairs measures for scoring msas, which weight some evolutionary events more strongly than others. Keywords: multiple sequence alignment, phylogenetic tree, scoring function, TSP, evolution 2

3 1 Introduction One of the most important applications of the data being generated from genome sequencing projects is to reconstruct models for the evolutionary histories of proteins, nucleic acids, and the organisms that carry them. A model for an evolutionary history typically consists of three parts: (a) an evolutionary tree, which shows the familial relationships of evolutionary objects (proteins, for example), (b) a multiple sequence alignment, which shows the evolutionary relationships between parts of these objects (in this case, individual amino acids in the protein sequence), and (c) reconstructed ancestral sequences, models for objects that were intermediates in the evolution of the family. Evolutionary histories are important in deducing biological function in biomolecules, predicting the folded conformation of protein sequences, and reconstructing the history of life on Earth. Constructing trees, multiple sequence alignments, and ancestral sequences encounters three sorts of problems. First, all are dependent on a model for evolutionary processes. At the level of the protein (which will be our exclusive concern here), modelling the evolutionary history of a family must begin with assumptions about the frequencies of amino acid mutations, insertions, and deletions. These frequencies are now available from empirical studies of protein sequences (Gonnet et al., 1992). In this work, a simple model, derived originally from work of Margaret Dayho (Dayho et al., 1978) and subsequently amplied, will be used. Second, constructing models for evolutionary histories encounters problems of mathematical complexity. The multiple sequence alignment (msa) problem and the construction of evolutionary trees is NP complete, and becomes computationally expensive for many 3

4 real protein families encountered in a contemporary database. This requires that virtually all msas and evolutionary trees that will be used in the post-genomic era will be constructed using approximate heuristics. The use of heuristics in constructing multiple sequence alignments creates a third problem, one centering on evaluation. Before using a msa or tree built by a heuristic, one would like to know approximately how closely the heuristic has approximated an optimum msa or tree. Even today, it is common for biochemists to evaluate by eye (and adjust by hand) the output of multiple sequence alignment tools. This is a clearly inadequate approach for any systematic reconstruction of natural history using genomic sequence data. A formal method for judging the quality of a multiple sequence alignment is needed. Accordingly, a variety of groups have proposed or used scoring functions (Altschul, 1989; Thompson et al., 1994; Higgins and Sharp, 1989; Lawrence et al., 1993) that assess the quality of a msa, following a simple approach that examines every pair of proteins in the family, generates a score for each pairwise alignment using a Dayho matrix (Dayho et al., 1978), and creates a score for the msa by summing each of the scores of the pairwise alignments. We shall call these \sum of pairs" (SP) methods. A B C Figure 1: A sample evolutionary tree (phylogenetic tree) 4

5 Although simple to implement, SP methods are obviously decient from an evolutionary perspective. Consider a simple tree (Figure 1) constructed for a family containing four proteins. The score of a pairwise alignment of sequence 1 and 2 (as generated by a standard SP scoring method) evaluates the likelihood of evolutionary events on edges B!1 and B!2 of the tree, the edges that represent the evolutionary distance from sequence 1 to sequence 2. Likewise, the score of a pairwise alignment of sequences 3 and 4 evaluates the likelihood of evolutionary events on edges C!3 and C!4 of the tree. By the same logic, scoring a pairwise alignment of sequences 1 and 3 evaluates the likelihood of evolutionary events on edges A!B, B!1, A!C, and C!3 of the tree. A B C Figure 2: Traversal of a phylogenetic tree using the SP measure (1-2, 1-3, 1-4, 2-3, 2-4, 3-4) By adding \ticks" to the evolutionary tree (Figure 2), it is readily seen that if the SP methods are completely implemented, dierent episodes of the evolutionary history of the protein family are counted dierent numbers of times. In this example, episodes A!B and A!C are each counted four times by the SP method, while episodes B!1, B!2, C!3, and C!4 are each counted only three times. There is no theoretical justication to suggest that these earlier evolutionary episodes are more important than later ones. Thus, SP methods are intrinsically problematic from an evolutionarily perspective for scoring 5

6 msas. Further, it is clear that as the number of sequences in the protein families in general increases as genome projects are completed, the over-weighting of more ancient events in preference to more recent events increases as well. What is needed is a scoring tool that counts each evolutionary episode equally in evaluating a multiple sequence alignment. We report here a \circular sum" (or CS) method for evaluating the quality of a multiple sequence alignment. The method exploits a heuristic that approximates a solution to the Travelling Salesman Problem, which identies a Eulerian tour through an evolutionary tree connecting the sequences in a protein family. The algorithm gives an upper bound, the best score that can possibly be achieved by any msa for a given set of protein sequences. The bound can be derived without an explicit knowledge of the correct evolutionary tree; thus the method can be applied without need to address the mathematical issues involved in tree construction. Last, it gives us an absolute score of multiple sequence alignments which is important in designing and verifying multiple sequence alignment heuristics. 6

7 2 Methods The task of generating an evolutionary unbiased score for a multiple sequence alignment can be rephrased as a simple problem: How can a tree be traversed, going from leaf to leaf, such that all branches (each representing its own episode of evolutionary divergence) are counted the same number of times? Assume that a tree contains 4 leaves, numbered from 1 to 4. The problem can be solved by walking through the tree in a circular order (known as Eulerian tour in a undirected graph), that is, from leaf 1 to 2, from 2 to 3, from 3 to 4, and then back from 4 to leaf 1 (Figure 3). A B C Figure 3: Traversal of a phylogenetic tree in circular order (1-2, 1-3, 1-4, 2-3, 2-4, 3-4) The Eulerian tour traverses all branches and visits each leaf exactly twice. Thus, all branches of the evolutionary tree are weighted equally. It is easily veried that many good circular routes exist, e.g , but not (Figure 4). In the second example, branches A!B and A!C are counted four times, while branches B!1, B!2, C!3, and C!4 at the leaves are counted only twice. 7

8 Circular order A Non-circular order A B C B C Figure 4: Traversal of a phylogenetic tree in a circular ( ) and non-circular order ( ) 2.1 Probabilities and Scores Scores of Pairwise alignments To obtain a score from a Eulerian path traversing an evolutionary tree, we must evaluate a probability of the evolutionary events that are represented by each segment of the tree (or graph) traversed. For example, in Figure 3), the branch B!1 is associated with the probability that the common ancestor of 1 evolved from ancestor B, while the branch B!2 is associated with the probability that the same ancestor evolved to give protein 2. The probability that protein 1 and protein 2 are related is therefore the product of the two probabilities associated with segments B!1 and B!2. To actually calculate these probabilities, one applies a Markovian model for sequence evolution. This begins with an alignment of the two sequences, e.g. VNRLQQNIVSL EVDHKVANYKPQVEPFGHGPIFMATALVPGLYLLPL VNRLQQSIVSLRDAFNDGTKLLEELDHRVLNYKPQANPFGNGPIFMVTAIVPGLHLLPI The gaps arise from insertions (or their counterpart deletions) during divergent evolution. The alignment is normally done by a dynamic programming (DP) algorithm using Dayho 8

9 matrices (Needleman and Wunsch, 1970), which nds the alignment that maximizes the probability that the two sequences evolved from an ancestral sequence as opposed to being random sequences. More precisely, we are comparing two possibilities a) that the two sequences arose independently of each other (implying that the alignment is entirely arbitrary, with amino acid i in one protein being aligned to amino acid j in the other is occurring no more frequently than expected by chance, which is equal to the product of the individual frequencies with which amino acids i and j occur in the database) P rfi and j are independentg = f i f j (1) b) that the two sequences have evolved from some common ancestral sequence after t units of evolution where t is measured in PAM units, the number of point accepted mutations per 100 amino acids (Gonnet, 1994b). x (unknown) ancestor sequence i j present day sequences P rfi and j descended from xg = X x = X x = X x f x P rfx! igp rfx! jg (2) f x (M t ) ix (M t ) jx (3) f j (M t ) ix (M t ) xj (4) = f j (M 2t ) ij = f i (M 2t ) ji (5) 9

10 The entries of the Dayho matrix are the logarithm of the quotient of these two probabilities. D ij = 10 log 10 Scores are dened as follows:! P rfi and j descended from some xg P rfi and j are independentg (6) Score = 10 log 10! P rfeventg P rfnull eventg (7) The constant 10 and the logarithm base 10 are used for historical/practical purposes, so that scores are comparable with standard sequence alignments. Scores are preferred over probabilities because they can be added and have a more reasonable range. Probability of an evolutionary tree conguration To obtain the probability for an entire evolutionary tree conguration, one must sum the logarithms of the probabilities of each event represented by the tree, which yields an overall score that is equivalent to the logarithm of the product of the probabilities of each event. For the example tree in Figure 3, the total evolutionary history has six associated probabilities: A!B, A!C, B!1, B!2, C!3 and C!4. If we knew the sequences of the ancestral proteins at the internal nodes of the tree, the probability of the entire tree is: P rftree g 2g = P rf A!B gp rf A!C gp rf B!1 gp rf B!2 gp rf C!3 gp rf C!4 g (8) 10

11 where P rf X!Y g is interpreted as the probability of mutating from the amino acid in the node X to the amino acid in node Y according to the distance of the branch X-Y. The individual probabilities would be obtained by aligning the protein sequences at the beginning and end of each episode of evolution and scoring them using a Dayho matrix (for example) as described above. Normally, of course, the sequences for the ancestor proteins A, B and C are not known, as the organisms that contained them have long since died. In (Gonnet and Benner, 1996), a formula for the probability of an entire evolutionary conguration (the ensemble of the phylogenetic tree and the corresponding msa) is derived. It corresponds exactly to the notion of computing the probability of traversing each edge of the tree. We can compute the probability of the tree conguration by adding the probabilities of all episodes represented by each of the lines. P rftree g 2g = X A f A X X B C P rf A!B g (9) where P X denotes a sum where X takes all 20 possible amino acid values. The probability associated with the entire evoutionary conguration is: P rfevol.confg = my X X X Y f I1 [k] ::: M da De [k] a ;An [k] a k=1 I 1 [k] I 2 [k] I n?1[k] a (10) Where k runs from 1 to m (each amino acid), I j is the unknown sequence at the internal node j, and hence I j [k] runs through the 20 amino acids, Q a runs through all the branches of the phylogenetic tree, and De [k] a, An [k] a are the descendent/ancestor amino acids at position k of the branch a, and d a is the length in PAM units of the branch a. 11

12 2.2 Travelling Salesman Problem Application The probability (exponential of the score) derived from pairwise alignments are now the key to identifying an Eulerian Tour. For a set of protein sequences it is computationally simple to obtain a set of (n)(n + 1)=2 pairwise scores by aligning each sequence with every other sequence using a DP algorithm to obtain the Optimal Pairwise Alignment. We shall refer to these as OPA scores, to distinguish (see below) these from a pairwise alignment inferred from a msa. These scores can be obtained without either a tree or a msa. To compute probabilities of composed paths, we have to multiply the probabilities. By using logarithms of probabilities (scores) we need to add them. It is readily seen that an Eulerian tour is the shortest tour through a tree. Any other route would evaluate at least one branch or evolutionary event more than twice which would result in a lower probability (and in a longer path). Hence, using such an ordering also implies maximizing the probability of the evolution for a particular phylogenetic tree. This means that we would be using an underlying phylogenetic tree which is the one with the highest probability, or a maximum likelihood tree. To nd such an Eulerian tour using OPA scores, one applies a heuristic for addressing the Travelling Salesman Problem, [ref] a well known task that seeks the shortest route between a set of points that visits each point once 1. Since the CS score using OPA scores is the maximum possible score for a given set of sequences, we refer to it as CS max. 1 We understand that the TSP may be very dicult to solve, in the sense that it may have superpolynomial complexity. For real computations, this is not a problem. First, we only need an approximate solution, as all our distances are estimates and have an inherent random noise. 12

13 To calculate the score, let be the output permutation of the TSP, then the maximum score is max(cs) = 1 2 nx i=1 OP AScore(S i ; S i+1 ) (11) with n+1 = 1 and OPAScore being the score of aligning two sequences as dened above. The maximal score CS max serves as an upper bound. This is an immediate conclusion from the observation that the score for each pair of sequences obtained using dynamic programming is the maximum score of any possible alignment (OPA score). Whenever we score a msa, regardless of the algorithm used to generate it, the score of this alignment must be less than or equal to the upper bound. The upper bound can be used to evaluate a given msa. The closer the score of a calculated msa is to the upper bound, the better (or more likely) the msa is. We have now a way to determine a circular order and thus an upper bound on the score of a msa and its associated evolutionary tree for a set of sequences. This score has been obtained without requiring the calculation of an evolutionary tree. Only the sequences at the leaves and their OPA scores with each other are needed. Nevertheless the CS max score corresponds to the probability of an evolutionary tree conguration. Circular vs non-circular paths To gain an intuition for the dierence between the scores of circular versus non-circular ordering, a simulation was run. Evolution was simulated starting with 100 random sequences, each of which was evolved using a Markovian model of evolution. At the end of the simulation, each ancestor had generated 16 descendants. Each of the 100 trees was scored in three dierent ways. First, the maximum score 13

14 was calculated using the TSP ordering that is calculated from the OPA scores (rst row). Second, real scores have inherent error, implying that CS scores need not be identical if calculated by dierent circular paths, even though all are Eulerian 2. Therefore, the average score of eight other circular orders were calculated (second row). The variance of the circular path lengths is an estimate on the quality of the data 3. Last, the average was calculated for the scores of eight non-circular orderings (third row)(figure 5). Order Score Max score? Score TSP circular non-circular Figure 5: Comparing dierent orders against the TSP order. The second column is the absolute score, the third column is the dierence of the score from the TSP ordering minus the score from the circular or non-circular orders 2 we expect, of course, these scores to be similar 3 Note that the dierence between any circular and any non-circular path is at least twice the length of the shortest branch in the tree, because any non-circular order counts at least one branch more than twice 14

15 3 Scoring of Multiple Sequence Alignments without evolutionary trees The CS algorithm provides a method to evaluate a specic msa or to compare dierent algorithms for consructing msas. When a msa is scored using the CS measure, the score is the probability of such an alignment happening due to common acestry compared to random sequences. To do so, we abandon scores obtained pairwise from OPAs. Instead, the pairwise relationship between two sequences in the protein family are extracted from the multiple sequence alignment itself. These are scored using a Dayho matrix without dynamic programming re-alignment to give an optimal alignment. These are msa-derived pairwise alignments (MPA). MPA scores must be equal to or less than the scores obtained from the OPA pairwise alignments. The CS method can be applied now again without the need for an explitly dened evolutionary tree by repeating the procedure outlined above, but using MPA scores instead of OPA scores as input for a TSP algorithm. The CS score of the msa is then the sum of MPA scores in the circular order divided by two: Score = 1 2 mx nx k=1 i=1 (S i [k]; S i+1 [k]) (12) 6 (a; b) = D ab (13) ( ; b) = (a; ) = Deletion penalty (the whole gap is scored) (14) ( ; ) = 0 (15) (6 b;?) = (?; b) = 0 (16) (17) 15

16 where k runs through each amino acid in a sequence, m is the length of the msa, n is the number of sequences, a and b are amino acids, ' ' is an internal gap (deletion), '6 b' is a gap at the beginning or end of the sequence, and '?' represents any symbol. Note that if there is a deletion in both sequences, there is no penalty (score is zero) because that deletion happened in some ancestor, and its penalty has been counted already. 3.1 Connection between Trees and Msas The formula for deriving the CS score for a msa and for its associated evolutionary tree are identical. When we score a msa with the CS method, we immediately also derive the score of the probability of the entire evolutionary conguration represented by the tree that corresponds to the msa. Since the score was derived from an Eulerian tour, it is maximal and hence it is the score of the maximum likelihood tree for that particular multiple sequence alignment. Similarly, the maximal score CS max based on the OPA scores is both an upper bound on the probability for the maximum likelihood tree and it is also an upper bound on the probability of an msa for the underlying sequences. 16

17 4 Simulation of Evolution To illustrate how the scoring function can be used, a variety of tools for generating msas were challenged with a set of protein families simulated following a Markovian model of evolution, and the outputs of each evaluated using the CS tool. This provides, of course, only an approximate assessment of the msa tools themselves. A better assessment must come with actual experimental sequence data. Random trees with a given structure and branch lengths and a random sequence at the root were generated. From this, sequence mutations, insertions and deletions were introduced according to the length of the branches of the tree. At each internal node a new sequence ws thus generated. At the end of the simulation, only the sequences at the leaves are retained. Since both the places of insertions and deletions, as well as the \real" tree are known, the correct msa is known as well. The retained sequences at the leaves can be given to dierent algorithms (MSA(Gupta et al., 1996; Lipman et al., 1989), MAP(Huang, 1994), ClustalW(Higgins and Sharp, 1989; Thompson et al., 1994) and the Probabilistic model (PAS) (Gonnet and Benner, 1996; Gonnet, 1994a)), and the score of the calculated msas can be compared to the score of the \real" (generated) msa using the CS measure. The results for 3 combinations of trees and sizes are shown. These are representative of all other results (Figures 6-8). For every tree the upper bound (CS max ) was calculated and compared to each of the scores (third column). Higher scores or smaller dierences mean better alignments. The rows are sorted into ascending order in this column. Figure 6 shows the result for balanced binary trees with 16 leaves. The length of the 17

18 Method CS Score Upper bound? CS score Upper bound: MSA: PAS Real: ClustalW: MAP: Dummy: Figure 6: Comparison of dierent msa methods: the CS score (second column) is calculated using a TSP ordering. The upper bound is the CS score based on the optimal pairwise alignment. The smaller the dierence between the upper bound and the CS score is (third column) the better the alignment sequences is 300 amino acids and the average branch distance is 30 PAM (so the maximum PAM distance between two sequences is about 240 PAM). For this tree, MSA scored the best followed by PAS, ClustalW and MAP. The alignments of MSA and PAS score even slightly better than the \real" alignment. The reason for this is that the real msas are not necessarily optimal. As a comparison the score of a bad alignment (all the sequences aligned without deletions) was calculated in the last row (Dummy). The tree on the right is an example of a generated phylogenetic tree. The next trees (Figure 7) are unbalanced with 30 sequences of length 300 and the average branch distance is 30 PAM (so the maximum PAM distance is about 300). Only 18

19 Method CS Score Upper bound? CS score Upper bound: Real: PAS ClustalW: MAP Dummy: Figure 7: Comparison of dierent msa methods: the CS score (second column) is calculated using a TSP ordering. The upper bound is the CS score based on the optimal pairwise alignment. The smaller the dierence between the upper bound and the CS score is (third column) the better the alignment PAS, MAP and ClustalW were able to compute the alignments (in a reasonable time). In this case PAS did slightly better than ClustalW. The trees of the last table shown here (Figure 8) had 50 sequences with length 300 and an average branch length of 15 PAM. In the simulation the PAS method was the best followed by ClustalW. No other methods were able to calculate a msa in a reasonable time. These experiments show how the CS measure can be used. Obviously the ultimate evaluation of tools must be done with real rather than simulated data. 19

20 Method CS Score Upper bound? CS score Upper bound: Real: PAS: ClustalW: Dummy: Figure 8: Comparison of dierent msa methods: the CS score (second column) is calculated using a TSP ordering. The upper bound is the CS score based on the optimal pairwise alignment. The smaller the dierence between the upper bound and the CS score is (third column) the better the alignment 5 Discussion We have dened a new scoring function, called CS measure, for the evaluation of msas that is based on a Markovian model of evolution. The frequently used SP measure, which calculates the sum of the scores of all pairwise alignments, ignores the structure of any associated phylogenetic tree and thus weights some evolutionary events more than others. This can be avoided by traversing the tree in a circular order, where all branches are traversed exactly twice (Eulerian tour). Any other non-circular tour results in a longer path, because some branches are traversed more than twice, and hence some evolutionary events are counted more often, which corresponds to a lower probability. Therefore the shortest path through the tree, wich is an Eulerian tour, yields the highest CS score. Such a tour can be calculated with any TSP algorithm. 20

21 The TSP application allows us to avoid the calculation of an evolutionary tree altogether, and only the sequences at the leaves are needed with their OPA or MPA scores. In addition, the CS measure yields a direct connection between a msa and the associated evolutionary tree, because the formula for the calculation of the score is exactly the same. This allows us to use the CS measure to construct and evaluate evolutionary trees. The CS measure based on the OPA scores gives an upper bound on the score of the optimal msa and evolutionary tree. To illustrate the new scoring tool we simulated evolution by generating random trees according to a Markovian model with the corresponding msa. The CS score of the generated alignment was then compared to the score of alignments calculated by dierent methods (MSA, ClustalW, PAS, and MAP). For less than twenty sequences MSA and PAS gave the best score, whereas for larger samples PAS and ClustalW calculated the best alignments. A better assessment must come with actual experimental sequence data though. With the new CS measure we have now the possibility of improving current algorithms and nding new heuristics by maximizing the scoring function. The description of the new methodologies exceeds the scope of this paper and will be available in (Gonnet and Korostensky, 1998). Our scoring function and msa algorithms can be used freely through the CBRG server at 21

22 References Altschul, S. F. (1989). Gap costs for multiple sequence alignment. J. theoretical Biology, 138:297{309. Dayho, M. O., Schwartz, R. M., and Orcutt, B. C. (1978). A model for evolutionary change in proteins. In Dayho, M. O., editor, Atlas of Protein Sequence and Structure, volume 5, pages 345{352. National Biochemical Research Foundation, Washington DC. Gonnet, G. H. (1994a). New algorithms for the computation of evolutionary phylogenetic trees. In Suhai, S., editor, Computational Methods In Genome Research, pages 153{ 161. Plenum Press, New York. Gonnet, G. H. (1994b). A tutorial introduction to computational biochemistry using Darwin. Gonnet, G. H. and Benner, S. A. (1996). Probabilistic ancestral sequences and multiple alignments. Fifth Scandinavian Workshop on Algorithm Theory, Reykjevik July Gonnet, G. H., Cohen, M. A., and Benner, S. A. (1992). Exhaustive matching of the entire protein sequence database. Science, 256:1443{1445. Gonnet, G. H. and Korostensky, C. (1998). New heuristics for multiple sequence alignments. In preparation. 22

23 Gupta, S. K., Kececioglu, J., and Schaer, A. A. (1996). Improving the practical space and time eciency of the shortest-paths approach to sum-of-pairs multiple sequence alignment. J. Computational Biology. To appear. Higgins, D. and Sharp, P. (1989). Fast abd sensitive multiple sequence alignments on a microcomputer. CABIOS, 5:151{153. Huang, X. (1994). On global sequence alignment. CABIOS, 10(3):227{235. Lawrence, C. E., Altschul, S. F., Boguski, M. S., Liu, J. S., Neuwald, A. F., and Wootton, J. C. (1993). Detecting subtle sequence signals: A gibbs sampling strategy for multiple alignment. Science, 262:208{214. Lipman, D. J., Altschul, S. F., and Kececioglu, J. D. (1989). A tool for multiple sequence alignment. Proc. Natl. Acad. Sci. USA, 86:4412{4415. Needleman, S. B. and Wunsch, C. D. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol., 48:443{453. Thompson, J., Higgins, D., and Gibson, T. (1994). Clustal w: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positionsspecic gap penalties and weight matrix choice. Nucleic Acids Research, 22:4673{

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot

More information

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) Bioinformática Sequence Alignment Pairwise Sequence Alignment Universidade da Beira Interior (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) 1 16/3/29 & 23/3/29 27/4/29 Outline

More information

Single alignment: Substitution Matrix. 16 march 2017

Single alignment: Substitution Matrix. 16 march 2017 Single alignment: Substitution Matrix 16 march 2017 BLOSUM Matrix BLOSUM Matrix [2] (Blocks Amino Acid Substitution Matrices ) It is based on the amino acids substitutions observed in ~2000 conserved block

More information

Effects of Gap Open and Gap Extension Penalties

Effects of Gap Open and Gap Extension Penalties Brigham Young University BYU ScholarsArchive All Faculty Publications 200-10-01 Effects of Gap Open and Gap Extension Penalties Hyrum Carroll hyrumcarroll@gmail.com Mark J. Clement clement@cs.byu.edu See

More information

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM) Bioinformatics II Probability and Statistics Universität Zürich and ETH Zürich Spring Semester 2009 Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM) Dr Fraser Daly adapted from

More information

Quantifying sequence similarity

Quantifying sequence similarity Quantifying sequence similarity Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, February 16 th 2016 After this lecture, you can define homology, similarity, and identity

More information

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types Exp 11- THEORY Sequence Alignment is a process of aligning two sequences to achieve maximum levels of identity between them. This help to derive functional, structural and evolutionary relationships between

More information

In-Depth Assessment of Local Sequence Alignment

In-Depth Assessment of Local Sequence Alignment 2012 International Conference on Environment Science and Engieering IPCBEE vol.3 2(2012) (2012)IACSIT Press, Singapoore In-Depth Assessment of Local Sequence Alignment Atoosa Ghahremani and Mahmood A.

More information

Local Alignment Statistics

Local Alignment Statistics Local Alignment Statistics Stephen Altschul National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, MD Central Issues in Biological Sequence Comparison

More information

Pairwise & Multiple sequence alignments

Pairwise & Multiple sequence alignments Pairwise & Multiple sequence alignments Urmila Kulkarni-Kale Bioinformatics Centre 411 007 urmila@bioinfo.ernet.in Basis for Sequence comparison Theory of evolution: gene sequences have evolved/derived

More information

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University Sequence Alignment: A General Overview COMP 571 - Fall 2010 Luay Nakhleh, Rice University Life through Evolution All living organisms are related to each other through evolution This means: any pair of

More information

Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM).

Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM). 1 Bioinformatics: In-depth PROBABILITY & STATISTICS Spring Semester 2011 University of Zürich and ETH Zürich Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM). Dr. Stefanie Muff

More information

Algorithms in Bioinformatics

Algorithms in Bioinformatics Algorithms in Bioinformatics Sami Khuri Department of omputer Science San José State University San José, alifornia, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri Pairwise Sequence Alignment Homology

More information

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9 Lecture 5 Alignment I. Introduction. For sequence data, the process of generating an alignment establishes positional homologies; that is, alignment provides the identification of homologous phylogenetic

More information

RECOVERING NORMAL NETWORKS FROM SHORTEST INTER-TAXA DISTANCE INFORMATION

RECOVERING NORMAL NETWORKS FROM SHORTEST INTER-TAXA DISTANCE INFORMATION RECOVERING NORMAL NETWORKS FROM SHORTEST INTER-TAXA DISTANCE INFORMATION MAGNUS BORDEWICH, KATHARINA T. HUBER, VINCENT MOULTON, AND CHARLES SEMPLE Abstract. Phylogenetic networks are a type of leaf-labelled,

More information

Pairwise sequence alignment

Pairwise sequence alignment Department of Evolutionary Biology Example Alignment between very similar human alpha- and beta globins: GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKL G+ +VK+HGKKV A+++++AH+D++ +++++LS+LH KL GNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKL

More information

Substitution matrices

Substitution matrices Introduction to Bioinformatics Substitution matrices Jacques van Helden Jacques.van-Helden@univ-amu.fr Université d Aix-Marseille, France Lab. Technological Advances for Genomics and Clinics (TAGC, INSERM

More information

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches Int. J. Bioinformatics Research and Applications, Vol. x, No. x, xxxx Phylogenies Scores for Exhaustive Maximum Likelihood and s Searches Hyrum D. Carroll, Perry G. Ridge, Mark J. Clement, Quinn O. Snell

More information

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre Bioinformatics Scoring Matrices David Gilbert Bioinformatics Research Centre www.brc.dcs.gla.ac.uk Department of Computing Science, University of Glasgow Learning Objectives To explain the requirement

More information

Page 1. Evolutionary Trees. Why build evolutionary tree? Outline

Page 1. Evolutionary Trees. Why build evolutionary tree? Outline Page Evolutionary Trees Russ. ltman MI S 7 Outline. Why build evolutionary trees?. istance-based vs. character-based methods. istance-based: Ultrametric Trees dditive Trees. haracter-based: Perfect phylogeny

More information

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT 3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT.03.239 25.09.2012 SEQUENCE ANALYSIS IS IMPORTANT FOR... Prediction of function Gene finding the process of identifying the regions of genomic DNA that encode

More information

Sequence Alignment: Scoring Schemes. COMP 571 Luay Nakhleh, Rice University

Sequence Alignment: Scoring Schemes. COMP 571 Luay Nakhleh, Rice University Sequence Alignment: Scoring Schemes COMP 571 Luay Nakhleh, Rice University Scoring Schemes Recall that an alignment score is aimed at providing a scale to measure the degree of similarity (or difference)

More information

Copyright 2000 N. AYDIN. All rights reserved. 1

Copyright 2000 N. AYDIN. All rights reserved. 1 Introduction to Bioinformatics Prof. Dr. Nizamettin AYDIN naydin@yildiz.edu.tr Multiple Sequence Alignment Outline Multiple sequence alignment introduction to msa methods of msa progressive global alignment

More information

Moreover, the circular logic

Moreover, the circular logic Moreover, the circular logic How do we know what is the right distance without a good alignment? And how do we construct a good alignment without knowing what substitutions were made previously? ATGCGT--GCAAGT

More information

Sequence Alignment Techniques and Their Uses

Sequence Alignment Techniques and Their Uses Sequence Alignment Techniques and Their Uses Sarah Fiorentino Since rapid sequencing technology and whole genomes sequencing, the amount of sequence information has grown exponentially. With all of this

More information

Evolutionary Tree Analysis. Overview

Evolutionary Tree Analysis. Overview CSI/BINF 5330 Evolutionary Tree Analysis Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Backgrounds Distance-Based Evolutionary Tree Reconstruction Character-Based

More information

A New Similarity Measure among Protein Sequences

A New Similarity Measure among Protein Sequences A New Similarity Measure among Protein Sequences Kuen-Pin Wu, Hsin-Nan Lin, Ting-Yi Sung and Wen-Lian Hsu * Institute of Information Science Academia Sinica, Taipei 115, Taiwan Abstract Protein sequence

More information

BLAST: Target frequencies and information content Dannie Durand

BLAST: Target frequencies and information content Dannie Durand Computational Genomics and Molecular Biology, Fall 2016 1 BLAST: Target frequencies and information content Dannie Durand BLAST has two components: a fast heuristic for searching for similar sequences

More information

An Introduction to Sequence Similarity ( Homology ) Searching

An Introduction to Sequence Similarity ( Homology ) Searching An Introduction to Sequence Similarity ( Homology ) Searching Gary D. Stormo 1 UNIT 3.1 1 Washington University, School of Medicine, St. Louis, Missouri ABSTRACT Homologous sequences usually have the same,

More information

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013 Sequence Alignments Dynamic programming approaches, scoring, and significance Lucy Skrabanek ICB, WMC January 31, 213 Sequence alignment Compare two (or more) sequences to: Find regions of conservation

More information

Biochemistry 324 Bioinformatics. Pairwise sequence alignment

Biochemistry 324 Bioinformatics. Pairwise sequence alignment Biochemistry 324 Bioinformatics Pairwise sequence alignment How do we compare genes/proteins? When we have sequenced a genome, we try and identify the function of unknown genes by finding a similar gene

More information

Gibbs Sampling Methods for Multiple Sequence Alignment

Gibbs Sampling Methods for Multiple Sequence Alignment Gibbs Sampling Methods for Multiple Sequence Alignment Scott C. Schmidler 1 Jun S. Liu 2 1 Section on Medical Informatics and 2 Department of Statistics Stanford University 11/17/99 1 Outline Statistical

More information

Sequence Alignment (chapter 6)

Sequence Alignment (chapter 6) Sequence lignment (chapter 6) he biological problem lobal alignment Local alignment Multiple alignment Introduction to bioinformatics, utumn 6 Background: comparative genomics Basic question in biology:

More information

Pairwise sequence alignments

Pairwise sequence alignments Pairwise sequence alignments Volker Flegel VI, October 2003 Page 1 Outline Introduction Definitions Biological context of pairwise alignments Computing of pairwise alignments Some programs VI, October

More information

Sequence Database Search Techniques I: Blast and PatternHunter tools

Sequence Database Search Techniques I: Blast and PatternHunter tools Sequence Database Search Techniques I: Blast and PatternHunter tools Zhang Louxin National University of Singapore Outline. Database search 2. BLAST (and filtration technique) 3. PatternHunter (empowered

More information

BINF6201/8201. Molecular phylogenetic methods

BINF6201/8201. Molecular phylogenetic methods BINF60/80 Molecular phylogenetic methods 0-7-06 Phylogenetics Ø According to the evolutionary theory, all life forms on this planet are related to one another by descent. Ø Traditionally, phylogenetics

More information

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Homology Modeling. Roberto Lins EPFL - summer semester 2005 Homology Modeling Roberto Lins EPFL - summer semester 2005 Disclaimer: course material is mainly taken from: P.E. Bourne & H Weissig, Structural Bioinformatics; C.A. Orengo, D.T. Jones & J.M. Thornton,

More information

Dr. Amira A. AL-Hosary

Dr. Amira A. AL-Hosary Phylogenetic analysis Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic Basics: Biological

More information

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche The molecular structure of a protein can be broken down hierarchically. The primary structure of a protein is simply its

More information

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT 5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT.03.239 03.10.2012 ALIGNMENT Alignment is the task of locating equivalent regions of two or more sequences to maximize their similarity. Homology:

More information

Sequence analysis and Genomics

Sequence analysis and Genomics Sequence analysis and Genomics October 12 th November 23 rd 2 PM 5 PM Prof. Peter Stadler Dr. Katja Nowick Katja: group leader TFome and Transcriptome Evolution Bioinformatics group Paul-Flechsig-Institute

More information

Lecture 15 - NP Completeness 1

Lecture 15 - NP Completeness 1 CME 305: Discrete Mathematics and Algorithms Instructor: Professor Aaron Sidford (sidford@stanford.edu) February 29, 2018 Lecture 15 - NP Completeness 1 In the last lecture we discussed how to provide

More information

Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17:

Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17: Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17:50 5001 5 Multiple Sequence Alignment The first part of this exposition is based on the following sources, which are recommended reading:

More information

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic analysis Phylogenetic Basics: Biological

More information

Non-independence in Statistical Tests for Discrete Cross-species Data

Non-independence in Statistical Tests for Discrete Cross-species Data J. theor. Biol. (1997) 188, 507514 Non-independence in Statistical Tests for Discrete Cross-species Data ALAN GRAFEN* AND MARK RIDLEY * St. John s College, Oxford OX1 3JP, and the Department of Zoology,

More information

C E N T R. Introduction to bioinformatics 2007 E B I O I N F O R M A T I C S V U F O R I N T. Lecture 5 G R A T I V. Pair-wise Sequence Alignment

C E N T R. Introduction to bioinformatics 2007 E B I O I N F O R M A T I C S V U F O R I N T. Lecture 5 G R A T I V. Pair-wise Sequence Alignment C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U Introduction to bioinformatics 2007 Lecture 5 Pair-wise Sequence Alignment Bioinformatics Nothing in Biology makes sense except in

More information

Practical considerations of working with sequencing data

Practical considerations of working with sequencing data Practical considerations of working with sequencing data File Types Fastq ->aligner -> reference(genome) coordinates Coordinate files SAM/BAM most complete, contains all of the info in fastq and more!

More information

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55 Pairwise Alignment Guan-Shieng Huang shieng@ncnu.edu.tw Dept. of CSIE, NCNU Pairwise Alignment p.1/55 Approach 1. Problem definition 2. Computational method (algorithms) 3. Complexity and performance Pairwise

More information

Sequence Analysis '17- lecture 8. Multiple sequence alignment

Sequence Analysis '17- lecture 8. Multiple sequence alignment Sequence Analysis '17- lecture 8 Multiple sequence alignment Ex5 explanation How many random database search scores have e-values 10? (Answer: 10!) Why? e-value of x = m*p(s x), where m is the database

More information

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU) Alignment principles and homology searching using (PSI-)BLAST Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU) http://ibivu.cs.vu.nl Bioinformatics Nothing in Biology makes sense except in

More information

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment Sequence Analysis 17: lecture 5 Substitution matrices Multiple sequence alignment Substitution matrices Used to score aligned positions, usually of amino acids. Expressed as the log-likelihood ratio of

More information

Bio nformatics. Lecture 3. Saad Mneimneh

Bio nformatics. Lecture 3. Saad Mneimneh Bio nformatics Lecture 3 Sequencing As before, DNA is cut into small ( 0.4KB) fragments and a clone library is formed. Biological experiments allow to read a certain number of these short fragments per

More information

First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences

First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences 140.638 where do sequences come from? DNA is not hard to extract (getting DNA from a

More information

Lecture 3: Markov chains.

Lecture 3: Markov chains. 1 BIOINFORMATIK II PROBABILITY & STATISTICS Summer semester 2008 The University of Zürich and ETH Zürich Lecture 3: Markov chains. Prof. Andrew Barbour Dr. Nicolas Pétrélis Adapted from a course by Dr.

More information

Research Proposal. Title: Multiple Sequence Alignment used to investigate the co-evolving positions in OxyR Protein family.

Research Proposal. Title: Multiple Sequence Alignment used to investigate the co-evolving positions in OxyR Protein family. Research Proposal Title: Multiple Sequence Alignment used to investigate the co-evolving positions in OxyR Protein family. Name: Minjal Pancholi Howard University Washington, DC. June 19, 2009 Research

More information

July 18, Approximation Algorithms (Travelling Salesman Problem)

July 18, Approximation Algorithms (Travelling Salesman Problem) Approximation Algorithms (Travelling Salesman Problem) July 18, 2014 The travelling-salesman problem Problem: given complete, undirected graph G = (V, E) with non-negative integer cost c(u, v) for each

More information

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz Phylogenetic Trees What They Are Why We Do It & How To Do It Presented by Amy Harris Dr Brad Morantz Overview What is a phylogenetic tree Why do we do it How do we do it Methods and programs Parallels

More information

Biologically significant sequence alignments using Boltzmann probabilities

Biologically significant sequence alignments using Boltzmann probabilities Biologically significant sequence alignments using Boltzmann probabilities P. Clote Department of Biology, Boston College Gasson Hall 416, Chestnut Hill MA 02467 clote@bc.edu May 7, 2003 Abstract In this

More information

Reading for Lecture 13 Release v10

Reading for Lecture 13 Release v10 Reading for Lecture 13 Release v10 Christopher Lee November 15, 2011 Contents 1 Evolutionary Trees i 1.1 Evolution as a Markov Process...................................... ii 1.2 Rooted vs. Unrooted Trees........................................

More information

Computational Biology

Computational Biology Computational Biology Lecture 6 31 October 2004 1 Overview Scoring matrices (Thanks to Shannon McWeeney) BLAST algorithm Start sequence alignment 2 1 What is a homologous sequence? A homologous sequence,

More information

Fast computation of maximum likelihood trees by numerical approximation of amino acid replacement probabilities

Fast computation of maximum likelihood trees by numerical approximation of amino acid replacement probabilities Computational Statistics & Data Analysis 40 (2002) 285 291 www.elsevier.com/locate/csda Fast computation of maximum likelihood trees by numerical approximation of amino acid replacement probabilities T.

More information

2 Notation and Preliminaries

2 Notation and Preliminaries On Asymmetric TSP: Transformation to Symmetric TSP and Performance Bound Ratnesh Kumar Haomin Li epartment of Electrical Engineering University of Kentucky Lexington, KY 40506-0046 Abstract We show that

More information

Sequence analysis and comparison

Sequence analysis and comparison The aim with sequence identification: Sequence analysis and comparison Marjolein Thunnissen Lund September 2012 Is there any known protein sequence that is homologous to mine? Are there any other species

More information

EVOLUTIONARY DISTANCES

EVOLUTIONARY DISTANCES EVOLUTIONARY DISTANCES FROM STRINGS TO TREES Luca Bortolussi 1 1 Dipartimento di Matematica ed Informatica Università degli studi di Trieste luca@dmi.units.it Trieste, 14 th November 2007 OUTLINE 1 STRINGS:

More information

Pairwise sequence alignments. Vassilios Ioannidis (From Volker Flegel )

Pairwise sequence alignments. Vassilios Ioannidis (From Volker Flegel ) Pairwise sequence alignments Vassilios Ioannidis (From Volker Flegel ) Outline Introduction Definitions Biological context of pairwise alignments Computing of pairwise alignments Some programs Importance

More information

Exercise 5. Sequence Profiles & BLAST

Exercise 5. Sequence Profiles & BLAST Exercise 5 Sequence Profiles & BLAST 1 Substitution Matrix (BLOSUM62) Likelihood to substitute one amino acid with another Figure taken from https://en.wikipedia.org/wiki/blosum 2 Substitution Matrix (BLOSUM62)

More information

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD William and Nancy Thompson Missouri Distinguished Professor Department

More information

7.1 Basis for Boltzmann machine. 7. Boltzmann machines

7.1 Basis for Boltzmann machine. 7. Boltzmann machines 7. Boltzmann machines this section we will become acquainted with classical Boltzmann machines which can be seen obsolete being rarely applied in neurocomputing. It is interesting, after all, because is

More information

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD Department of Computer Science University of Missouri 2008 Free for Academic

More information

A greedy, graph-based algorithm for the alignment of multiple homologous gene lists

A greedy, graph-based algorithm for the alignment of multiple homologous gene lists A greedy, graph-based algorithm for the alignment of multiple homologous gene lists Jan Fostier, Sebastian Proost, Bart Dhoedt, Yvan Saeys, Piet Demeester, Yves Van de Peer, and Klaas Vandepoele Bioinformatics

More information

O 3 O 4 O 5. q 3. q 4. Transition

O 3 O 4 O 5. q 3. q 4. Transition Hidden Markov Models Hidden Markov models (HMM) were developed in the early part of the 1970 s and at that time mostly applied in the area of computerized speech recognition. They are first described in

More information

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky MOLECULAR PHYLOGENY "Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky EVOLUTION - theory that groups of organisms change over time so that descendeants differ structurally

More information

CSE 549: Computational Biology. Substitution Matrices

CSE 549: Computational Biology. Substitution Matrices CSE 9: Computational Biology Substitution Matrices How should we score alignments So far, we ve looked at arbitrary schemes for scoring mutations. How can we assign scores in a more meaningful way? Are

More information

Lecture Notes: Markov chains

Lecture Notes: Markov chains Computational Genomics and Molecular Biology, Fall 5 Lecture Notes: Markov chains Dannie Durand At the beginning of the semester, we introduced two simple scoring functions for pairwise alignments: a similarity

More information

Graph Alignment and Biological Networks

Graph Alignment and Biological Networks Graph Alignment and Biological Networks Johannes Berg http://www.uni-koeln.de/ berg Institute for Theoretical Physics University of Cologne Germany p.1/12 Networks in molecular biology New large-scale

More information

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Substitution score matrices, PAM, BLOSUM Needleman-Wunsch algorithm (Global) Smith-Waterman algorithm (Local) BLAST (local, heuristic) E-value

More information

Multiple Sequence Alignment

Multiple Sequence Alignment Multiple Sequence Alignment Multiple Alignment versus Pairwise Alignment Up until now we have only tried to align two sequences. What about more than two? And what for? A faint similarity between two sequences

More information

Phylogeny and Evolution. Gina Cannarozzi ETH Zurich Institute of Computational Science

Phylogeny and Evolution. Gina Cannarozzi ETH Zurich Institute of Computational Science Phylogeny and Evolution Gina Cannarozzi ETH Zurich Institute of Computational Science History Aristotle (384-322 BC) classified animals. He found that dolphins do not belong to the fish but to the mammals.

More information

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment Introduction to Bioinformatics online course : IBT Jonathan Kayondo Learning Objectives Understand

More information

An ant colony algorithm for multiple sequence alignment in bioinformatics

An ant colony algorithm for multiple sequence alignment in bioinformatics An ant colony algorithm for multiple sequence alignment in bioinformatics Jonathan Moss and Colin G. Johnson Computing Laboratory University of Kent at Canterbury Canterbury, Kent, CT2 7NF, England. C.G.Johnson@ukc.ac.uk

More information

Background: comparative genomics. Sequence similarity. Homologs. Similarity vs homology (2) Similarity vs homology. Sequence Alignment (chapter 6)

Background: comparative genomics. Sequence similarity. Homologs. Similarity vs homology (2) Similarity vs homology. Sequence Alignment (chapter 6) Sequence lignment (chapter ) he biological problem lobal alignment Local alignment Multiple alignment Background: comparative genomics Basic question in biology: what properties are shared among organisms?

More information

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence Page Hidden Markov models and multiple sequence alignment Russ B Altman BMI 4 CS 74 Some slides borrowed from Scott C Schmidler (BMI graduate student) References Bioinformatics Classic: Krogh et al (994)

More information

A profile-based protein sequence alignment algorithm for a domain clustering database

A profile-based protein sequence alignment algorithm for a domain clustering database A profile-based protein sequence alignment algorithm for a domain clustering database Lin Xu,2 Fa Zhang and Zhiyong Liu 3, Key Laboratory of Computer System and architecture, the Institute of Computing

More information

Alignment & BLAST. By: Hadi Mozafari KUMS

Alignment & BLAST. By: Hadi Mozafari KUMS Alignment & BLAST By: Hadi Mozafari KUMS SIMILARITY - ALIGNMENT Comparison of primary DNA or protein sequences to other primary or secondary sequences Expecting that the function of the similar sequence

More information

BIOINFORMATICS: An Introduction

BIOINFORMATICS: An Introduction BIOINFORMATICS: An Introduction What is Bioinformatics? The term was first coined in 1988 by Dr. Hwa Lim The original definition was : a collective term for data compilation, organisation, analysis and

More information

... and searches for related sequences probably make up the vast bulk of bioinformatics activities.

... and searches for related sequences probably make up the vast bulk of bioinformatics activities. 1 2 ... and searches for related sequences probably make up the vast bulk of bioinformatics activities. 3 The terms homology and similarity are often confused and used incorrectly. Homology is a quality.

More information

Concepts and Methods in Molecular Divergence Time Estimation

Concepts and Methods in Molecular Divergence Time Estimation Concepts and Methods in Molecular Divergence Time Estimation 26 November 2012 Prashant P. Sharma American Museum of Natural History Overview 1. Why do we date trees? 2. The molecular clock 3. Local clocks

More information

Introduction to Bioinformatics

Introduction to Bioinformatics Introduction to Bioinformatics Jianlin Cheng, PhD Department of Computer Science Informatics Institute 2011 Topics Introduction Biological Sequence Alignment and Database Search Analysis of gene expression

More information

1.5 Sequence alignment

1.5 Sequence alignment 1.5 Sequence alignment The dramatic increase in the number of sequenced genomes and proteomes has lead to development of various bioinformatic methods and algorithms for extracting information (data mining)

More information

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018 CONCEPT OF SEQUENCE COMPARISON Natapol Pornputtapong 18 January 2018 SEQUENCE ANALYSIS - A ROSETTA STONE OF LIFE Sequence analysis is the process of subjecting a DNA, RNA or peptide sequence to any of

More information

Statistical Distributions of Optimal Global Alignment Scores of Random Protein Sequences

Statistical Distributions of Optimal Global Alignment Scores of Random Protein Sequences BMC Bioinformatics This Provisional PDF corresponds to the article as it appeared upon acceptance. The fully-formatted PDF version will become available shortly after the date of publication, from the

More information

Protein Threading. BMI/CS 776 Colin Dewey Spring 2015

Protein Threading. BMI/CS 776  Colin Dewey Spring 2015 Protein Threading BMI/CS 776 www.biostat.wisc.edu/bmi776/ Colin Dewey cdewey@biostat.wisc.edu Spring 2015 Goals for Lecture the key concepts to understand are the following the threading prediction task

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

More information

Multiple Sequence Alignment

Multiple Sequence Alignment Multiple Sequence Alignment Multiple Alignment versus Pairwise Alignment Up until now we have only tried to align two sequences.! What about more than two? And what for?! A faint similarity between two

More information

Sequence Analysis '17 -- lecture 7

Sequence Analysis '17 -- lecture 7 Sequence Analysis '17 -- lecture 7 Significance E-values How significant is that? Please give me a number for......how likely the data would not have been the result of chance,......as opposed to......a

More information

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I) CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I) Contents Alignment algorithms Needleman-Wunsch (global alignment) Smith-Waterman (local alignment) Heuristic algorithms FASTA BLAST

More information

An Introduction to Bioinformatics Algorithms Hidden Markov Models

An Introduction to Bioinformatics Algorithms   Hidden Markov Models Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

More information

CHAPTER 3 FUNDAMENTALS OF COMPUTATIONAL COMPLEXITY. E. Amaldi Foundations of Operations Research Politecnico di Milano 1

CHAPTER 3 FUNDAMENTALS OF COMPUTATIONAL COMPLEXITY. E. Amaldi Foundations of Operations Research Politecnico di Milano 1 CHAPTER 3 FUNDAMENTALS OF COMPUTATIONAL COMPLEXITY E. Amaldi Foundations of Operations Research Politecnico di Milano 1 Goal: Evaluate the computational requirements (this course s focus: time) to solve

More information

General context Anchor-based method Evaluation Discussion. CoCoGen meeting. Accuracy of the anchor-based strategy for genome alignment.

General context Anchor-based method Evaluation Discussion. CoCoGen meeting. Accuracy of the anchor-based strategy for genome alignment. CoCoGen meeting Accuracy of the anchor-based strategy for genome alignment Raluca Uricaru LIRMM, CNRS Université de Montpellier 2 3 octobre 2008 1 / 31 Summary 1 General context 2 Global alignment : anchor-based

More information

Tools and Algorithms in Bioinformatics

Tools and Algorithms in Bioinformatics Tools and Algorithms in Bioinformatics GCBA815, Fall 2015 Week-4 BLAST Algorithm Continued Multiple Sequence Alignment Babu Guda, Ph.D. Department of Genetics, Cell Biology & Anatomy Bioinformatics and

More information