Evaluation Measures of Multiple Sequence Alignments. Gaston H. Gonnet, *Chantal Korostensky and Steve Benner. Institute for Scientic Computing
|
|
- Tamsyn Garrison
- 5 years ago
- Views:
Transcription
1 Evaluation Measures of Multiple Sequence Alignments Gaston H. Gonnet, *Chantal Korostensky and Steve Benner Institute for Scientic Computing ETH Zurich, 8092 Zuerich, Switzerland phone: fgonnet,korosteng@inf.ethz.ch
2 Abstract Multiple sequence alignments (msas) are frequently used in the study of families of protein sequences or DNA/RNA sequences. They are a fundamental tool for the understanding of structure, functionality and ultimately evolution of proteins. A new algorithm, the Circular Sum (CS) method, is presented for formally evaluating the quality of a multiple sequence alignment (msa). It is based on the use of a heuristic that approximates a solution to the Travelling Salesman Problem, which identies a Eulerian tour through an evolutionary tree connecting the sequences in a protein family. The algorithm gives an upper bound, the best score that can possibly be achieved by any msa for a given set of protein sequences. Alternatively, if presented with a specic msa, the algorithm provides a formal score for the msa, which serves as an absolute measure of the quality of the msa. The CS measure can be calculated without the need to construct a corresponding evolutionary tree. It does, however, yield a direct connection between a msa and the associated evolutionary tree, which allows us to use the measure to construct and evaluate evolutionary trees and as a tool for evaluating dierent methods for producing msas. A brief example of the last application is provided. Because it weights all evolutionary events on a tree identically, the CS algorithm has advantages over the frequently used sum-of-pairs measures for scoring msas, which weight some evolutionary events more strongly than others. Keywords: multiple sequence alignment, phylogenetic tree, scoring function, TSP, evolution 2
3 1 Introduction One of the most important applications of the data being generated from genome sequencing projects is to reconstruct models for the evolutionary histories of proteins, nucleic acids, and the organisms that carry them. A model for an evolutionary history typically consists of three parts: (a) an evolutionary tree, which shows the familial relationships of evolutionary objects (proteins, for example), (b) a multiple sequence alignment, which shows the evolutionary relationships between parts of these objects (in this case, individual amino acids in the protein sequence), and (c) reconstructed ancestral sequences, models for objects that were intermediates in the evolution of the family. Evolutionary histories are important in deducing biological function in biomolecules, predicting the folded conformation of protein sequences, and reconstructing the history of life on Earth. Constructing trees, multiple sequence alignments, and ancestral sequences encounters three sorts of problems. First, all are dependent on a model for evolutionary processes. At the level of the protein (which will be our exclusive concern here), modelling the evolutionary history of a family must begin with assumptions about the frequencies of amino acid mutations, insertions, and deletions. These frequencies are now available from empirical studies of protein sequences (Gonnet et al., 1992). In this work, a simple model, derived originally from work of Margaret Dayho (Dayho et al., 1978) and subsequently amplied, will be used. Second, constructing models for evolutionary histories encounters problems of mathematical complexity. The multiple sequence alignment (msa) problem and the construction of evolutionary trees is NP complete, and becomes computationally expensive for many 3
4 real protein families encountered in a contemporary database. This requires that virtually all msas and evolutionary trees that will be used in the post-genomic era will be constructed using approximate heuristics. The use of heuristics in constructing multiple sequence alignments creates a third problem, one centering on evaluation. Before using a msa or tree built by a heuristic, one would like to know approximately how closely the heuristic has approximated an optimum msa or tree. Even today, it is common for biochemists to evaluate by eye (and adjust by hand) the output of multiple sequence alignment tools. This is a clearly inadequate approach for any systematic reconstruction of natural history using genomic sequence data. A formal method for judging the quality of a multiple sequence alignment is needed. Accordingly, a variety of groups have proposed or used scoring functions (Altschul, 1989; Thompson et al., 1994; Higgins and Sharp, 1989; Lawrence et al., 1993) that assess the quality of a msa, following a simple approach that examines every pair of proteins in the family, generates a score for each pairwise alignment using a Dayho matrix (Dayho et al., 1978), and creates a score for the msa by summing each of the scores of the pairwise alignments. We shall call these \sum of pairs" (SP) methods. A B C Figure 1: A sample evolutionary tree (phylogenetic tree) 4
5 Although simple to implement, SP methods are obviously decient from an evolutionary perspective. Consider a simple tree (Figure 1) constructed for a family containing four proteins. The score of a pairwise alignment of sequence 1 and 2 (as generated by a standard SP scoring method) evaluates the likelihood of evolutionary events on edges B!1 and B!2 of the tree, the edges that represent the evolutionary distance from sequence 1 to sequence 2. Likewise, the score of a pairwise alignment of sequences 3 and 4 evaluates the likelihood of evolutionary events on edges C!3 and C!4 of the tree. By the same logic, scoring a pairwise alignment of sequences 1 and 3 evaluates the likelihood of evolutionary events on edges A!B, B!1, A!C, and C!3 of the tree. A B C Figure 2: Traversal of a phylogenetic tree using the SP measure (1-2, 1-3, 1-4, 2-3, 2-4, 3-4) By adding \ticks" to the evolutionary tree (Figure 2), it is readily seen that if the SP methods are completely implemented, dierent episodes of the evolutionary history of the protein family are counted dierent numbers of times. In this example, episodes A!B and A!C are each counted four times by the SP method, while episodes B!1, B!2, C!3, and C!4 are each counted only three times. There is no theoretical justication to suggest that these earlier evolutionary episodes are more important than later ones. Thus, SP methods are intrinsically problematic from an evolutionarily perspective for scoring 5
6 msas. Further, it is clear that as the number of sequences in the protein families in general increases as genome projects are completed, the over-weighting of more ancient events in preference to more recent events increases as well. What is needed is a scoring tool that counts each evolutionary episode equally in evaluating a multiple sequence alignment. We report here a \circular sum" (or CS) method for evaluating the quality of a multiple sequence alignment. The method exploits a heuristic that approximates a solution to the Travelling Salesman Problem, which identies a Eulerian tour through an evolutionary tree connecting the sequences in a protein family. The algorithm gives an upper bound, the best score that can possibly be achieved by any msa for a given set of protein sequences. The bound can be derived without an explicit knowledge of the correct evolutionary tree; thus the method can be applied without need to address the mathematical issues involved in tree construction. Last, it gives us an absolute score of multiple sequence alignments which is important in designing and verifying multiple sequence alignment heuristics. 6
7 2 Methods The task of generating an evolutionary unbiased score for a multiple sequence alignment can be rephrased as a simple problem: How can a tree be traversed, going from leaf to leaf, such that all branches (each representing its own episode of evolutionary divergence) are counted the same number of times? Assume that a tree contains 4 leaves, numbered from 1 to 4. The problem can be solved by walking through the tree in a circular order (known as Eulerian tour in a undirected graph), that is, from leaf 1 to 2, from 2 to 3, from 3 to 4, and then back from 4 to leaf 1 (Figure 3). A B C Figure 3: Traversal of a phylogenetic tree in circular order (1-2, 1-3, 1-4, 2-3, 2-4, 3-4) The Eulerian tour traverses all branches and visits each leaf exactly twice. Thus, all branches of the evolutionary tree are weighted equally. It is easily veried that many good circular routes exist, e.g , but not (Figure 4). In the second example, branches A!B and A!C are counted four times, while branches B!1, B!2, C!3, and C!4 at the leaves are counted only twice. 7
8 Circular order A Non-circular order A B C B C Figure 4: Traversal of a phylogenetic tree in a circular ( ) and non-circular order ( ) 2.1 Probabilities and Scores Scores of Pairwise alignments To obtain a score from a Eulerian path traversing an evolutionary tree, we must evaluate a probability of the evolutionary events that are represented by each segment of the tree (or graph) traversed. For example, in Figure 3), the branch B!1 is associated with the probability that the common ancestor of 1 evolved from ancestor B, while the branch B!2 is associated with the probability that the same ancestor evolved to give protein 2. The probability that protein 1 and protein 2 are related is therefore the product of the two probabilities associated with segments B!1 and B!2. To actually calculate these probabilities, one applies a Markovian model for sequence evolution. This begins with an alignment of the two sequences, e.g. VNRLQQNIVSL EVDHKVANYKPQVEPFGHGPIFMATALVPGLYLLPL VNRLQQSIVSLRDAFNDGTKLLEELDHRVLNYKPQANPFGNGPIFMVTAIVPGLHLLPI The gaps arise from insertions (or their counterpart deletions) during divergent evolution. The alignment is normally done by a dynamic programming (DP) algorithm using Dayho 8
9 matrices (Needleman and Wunsch, 1970), which nds the alignment that maximizes the probability that the two sequences evolved from an ancestral sequence as opposed to being random sequences. More precisely, we are comparing two possibilities a) that the two sequences arose independently of each other (implying that the alignment is entirely arbitrary, with amino acid i in one protein being aligned to amino acid j in the other is occurring no more frequently than expected by chance, which is equal to the product of the individual frequencies with which amino acids i and j occur in the database) P rfi and j are independentg = f i f j (1) b) that the two sequences have evolved from some common ancestral sequence after t units of evolution where t is measured in PAM units, the number of point accepted mutations per 100 amino acids (Gonnet, 1994b). x (unknown) ancestor sequence i j present day sequences P rfi and j descended from xg = X x = X x = X x f x P rfx! igp rfx! jg (2) f x (M t ) ix (M t ) jx (3) f j (M t ) ix (M t ) xj (4) = f j (M 2t ) ij = f i (M 2t ) ji (5) 9
10 The entries of the Dayho matrix are the logarithm of the quotient of these two probabilities. D ij = 10 log 10 Scores are dened as follows:! P rfi and j descended from some xg P rfi and j are independentg (6) Score = 10 log 10! P rfeventg P rfnull eventg (7) The constant 10 and the logarithm base 10 are used for historical/practical purposes, so that scores are comparable with standard sequence alignments. Scores are preferred over probabilities because they can be added and have a more reasonable range. Probability of an evolutionary tree conguration To obtain the probability for an entire evolutionary tree conguration, one must sum the logarithms of the probabilities of each event represented by the tree, which yields an overall score that is equivalent to the logarithm of the product of the probabilities of each event. For the example tree in Figure 3, the total evolutionary history has six associated probabilities: A!B, A!C, B!1, B!2, C!3 and C!4. If we knew the sequences of the ancestral proteins at the internal nodes of the tree, the probability of the entire tree is: P rftree g 2g = P rf A!B gp rf A!C gp rf B!1 gp rf B!2 gp rf C!3 gp rf C!4 g (8) 10
11 where P rf X!Y g is interpreted as the probability of mutating from the amino acid in the node X to the amino acid in node Y according to the distance of the branch X-Y. The individual probabilities would be obtained by aligning the protein sequences at the beginning and end of each episode of evolution and scoring them using a Dayho matrix (for example) as described above. Normally, of course, the sequences for the ancestor proteins A, B and C are not known, as the organisms that contained them have long since died. In (Gonnet and Benner, 1996), a formula for the probability of an entire evolutionary conguration (the ensemble of the phylogenetic tree and the corresponding msa) is derived. It corresponds exactly to the notion of computing the probability of traversing each edge of the tree. We can compute the probability of the tree conguration by adding the probabilities of all episodes represented by each of the lines. P rftree g 2g = X A f A X X B C P rf A!B g (9) where P X denotes a sum where X takes all 20 possible amino acid values. The probability associated with the entire evoutionary conguration is: P rfevol.confg = my X X X Y f I1 [k] ::: M da De [k] a ;An [k] a k=1 I 1 [k] I 2 [k] I n?1[k] a (10) Where k runs from 1 to m (each amino acid), I j is the unknown sequence at the internal node j, and hence I j [k] runs through the 20 amino acids, Q a runs through all the branches of the phylogenetic tree, and De [k] a, An [k] a are the descendent/ancestor amino acids at position k of the branch a, and d a is the length in PAM units of the branch a. 11
12 2.2 Travelling Salesman Problem Application The probability (exponential of the score) derived from pairwise alignments are now the key to identifying an Eulerian Tour. For a set of protein sequences it is computationally simple to obtain a set of (n)(n + 1)=2 pairwise scores by aligning each sequence with every other sequence using a DP algorithm to obtain the Optimal Pairwise Alignment. We shall refer to these as OPA scores, to distinguish (see below) these from a pairwise alignment inferred from a msa. These scores can be obtained without either a tree or a msa. To compute probabilities of composed paths, we have to multiply the probabilities. By using logarithms of probabilities (scores) we need to add them. It is readily seen that an Eulerian tour is the shortest tour through a tree. Any other route would evaluate at least one branch or evolutionary event more than twice which would result in a lower probability (and in a longer path). Hence, using such an ordering also implies maximizing the probability of the evolution for a particular phylogenetic tree. This means that we would be using an underlying phylogenetic tree which is the one with the highest probability, or a maximum likelihood tree. To nd such an Eulerian tour using OPA scores, one applies a heuristic for addressing the Travelling Salesman Problem, [ref] a well known task that seeks the shortest route between a set of points that visits each point once 1. Since the CS score using OPA scores is the maximum possible score for a given set of sequences, we refer to it as CS max. 1 We understand that the TSP may be very dicult to solve, in the sense that it may have superpolynomial complexity. For real computations, this is not a problem. First, we only need an approximate solution, as all our distances are estimates and have an inherent random noise. 12
13 To calculate the score, let be the output permutation of the TSP, then the maximum score is max(cs) = 1 2 nx i=1 OP AScore(S i ; S i+1 ) (11) with n+1 = 1 and OPAScore being the score of aligning two sequences as dened above. The maximal score CS max serves as an upper bound. This is an immediate conclusion from the observation that the score for each pair of sequences obtained using dynamic programming is the maximum score of any possible alignment (OPA score). Whenever we score a msa, regardless of the algorithm used to generate it, the score of this alignment must be less than or equal to the upper bound. The upper bound can be used to evaluate a given msa. The closer the score of a calculated msa is to the upper bound, the better (or more likely) the msa is. We have now a way to determine a circular order and thus an upper bound on the score of a msa and its associated evolutionary tree for a set of sequences. This score has been obtained without requiring the calculation of an evolutionary tree. Only the sequences at the leaves and their OPA scores with each other are needed. Nevertheless the CS max score corresponds to the probability of an evolutionary tree conguration. Circular vs non-circular paths To gain an intuition for the dierence between the scores of circular versus non-circular ordering, a simulation was run. Evolution was simulated starting with 100 random sequences, each of which was evolved using a Markovian model of evolution. At the end of the simulation, each ancestor had generated 16 descendants. Each of the 100 trees was scored in three dierent ways. First, the maximum score 13
14 was calculated using the TSP ordering that is calculated from the OPA scores (rst row). Second, real scores have inherent error, implying that CS scores need not be identical if calculated by dierent circular paths, even though all are Eulerian 2. Therefore, the average score of eight other circular orders were calculated (second row). The variance of the circular path lengths is an estimate on the quality of the data 3. Last, the average was calculated for the scores of eight non-circular orderings (third row)(figure 5). Order Score Max score? Score TSP circular non-circular Figure 5: Comparing dierent orders against the TSP order. The second column is the absolute score, the third column is the dierence of the score from the TSP ordering minus the score from the circular or non-circular orders 2 we expect, of course, these scores to be similar 3 Note that the dierence between any circular and any non-circular path is at least twice the length of the shortest branch in the tree, because any non-circular order counts at least one branch more than twice 14
15 3 Scoring of Multiple Sequence Alignments without evolutionary trees The CS algorithm provides a method to evaluate a specic msa or to compare dierent algorithms for consructing msas. When a msa is scored using the CS measure, the score is the probability of such an alignment happening due to common acestry compared to random sequences. To do so, we abandon scores obtained pairwise from OPAs. Instead, the pairwise relationship between two sequences in the protein family are extracted from the multiple sequence alignment itself. These are scored using a Dayho matrix without dynamic programming re-alignment to give an optimal alignment. These are msa-derived pairwise alignments (MPA). MPA scores must be equal to or less than the scores obtained from the OPA pairwise alignments. The CS method can be applied now again without the need for an explitly dened evolutionary tree by repeating the procedure outlined above, but using MPA scores instead of OPA scores as input for a TSP algorithm. The CS score of the msa is then the sum of MPA scores in the circular order divided by two: Score = 1 2 mx nx k=1 i=1 (S i [k]; S i+1 [k]) (12) 6 (a; b) = D ab (13) ( ; b) = (a; ) = Deletion penalty (the whole gap is scored) (14) ( ; ) = 0 (15) (6 b;?) = (?; b) = 0 (16) (17) 15
16 where k runs through each amino acid in a sequence, m is the length of the msa, n is the number of sequences, a and b are amino acids, ' ' is an internal gap (deletion), '6 b' is a gap at the beginning or end of the sequence, and '?' represents any symbol. Note that if there is a deletion in both sequences, there is no penalty (score is zero) because that deletion happened in some ancestor, and its penalty has been counted already. 3.1 Connection between Trees and Msas The formula for deriving the CS score for a msa and for its associated evolutionary tree are identical. When we score a msa with the CS method, we immediately also derive the score of the probability of the entire evolutionary conguration represented by the tree that corresponds to the msa. Since the score was derived from an Eulerian tour, it is maximal and hence it is the score of the maximum likelihood tree for that particular multiple sequence alignment. Similarly, the maximal score CS max based on the OPA scores is both an upper bound on the probability for the maximum likelihood tree and it is also an upper bound on the probability of an msa for the underlying sequences. 16
17 4 Simulation of Evolution To illustrate how the scoring function can be used, a variety of tools for generating msas were challenged with a set of protein families simulated following a Markovian model of evolution, and the outputs of each evaluated using the CS tool. This provides, of course, only an approximate assessment of the msa tools themselves. A better assessment must come with actual experimental sequence data. Random trees with a given structure and branch lengths and a random sequence at the root were generated. From this, sequence mutations, insertions and deletions were introduced according to the length of the branches of the tree. At each internal node a new sequence ws thus generated. At the end of the simulation, only the sequences at the leaves are retained. Since both the places of insertions and deletions, as well as the \real" tree are known, the correct msa is known as well. The retained sequences at the leaves can be given to dierent algorithms (MSA(Gupta et al., 1996; Lipman et al., 1989), MAP(Huang, 1994), ClustalW(Higgins and Sharp, 1989; Thompson et al., 1994) and the Probabilistic model (PAS) (Gonnet and Benner, 1996; Gonnet, 1994a)), and the score of the calculated msas can be compared to the score of the \real" (generated) msa using the CS measure. The results for 3 combinations of trees and sizes are shown. These are representative of all other results (Figures 6-8). For every tree the upper bound (CS max ) was calculated and compared to each of the scores (third column). Higher scores or smaller dierences mean better alignments. The rows are sorted into ascending order in this column. Figure 6 shows the result for balanced binary trees with 16 leaves. The length of the 17
18 Method CS Score Upper bound? CS score Upper bound: MSA: PAS Real: ClustalW: MAP: Dummy: Figure 6: Comparison of dierent msa methods: the CS score (second column) is calculated using a TSP ordering. The upper bound is the CS score based on the optimal pairwise alignment. The smaller the dierence between the upper bound and the CS score is (third column) the better the alignment sequences is 300 amino acids and the average branch distance is 30 PAM (so the maximum PAM distance between two sequences is about 240 PAM). For this tree, MSA scored the best followed by PAS, ClustalW and MAP. The alignments of MSA and PAS score even slightly better than the \real" alignment. The reason for this is that the real msas are not necessarily optimal. As a comparison the score of a bad alignment (all the sequences aligned without deletions) was calculated in the last row (Dummy). The tree on the right is an example of a generated phylogenetic tree. The next trees (Figure 7) are unbalanced with 30 sequences of length 300 and the average branch distance is 30 PAM (so the maximum PAM distance is about 300). Only 18
19 Method CS Score Upper bound? CS score Upper bound: Real: PAS ClustalW: MAP Dummy: Figure 7: Comparison of dierent msa methods: the CS score (second column) is calculated using a TSP ordering. The upper bound is the CS score based on the optimal pairwise alignment. The smaller the dierence between the upper bound and the CS score is (third column) the better the alignment PAS, MAP and ClustalW were able to compute the alignments (in a reasonable time). In this case PAS did slightly better than ClustalW. The trees of the last table shown here (Figure 8) had 50 sequences with length 300 and an average branch length of 15 PAM. In the simulation the PAS method was the best followed by ClustalW. No other methods were able to calculate a msa in a reasonable time. These experiments show how the CS measure can be used. Obviously the ultimate evaluation of tools must be done with real rather than simulated data. 19
20 Method CS Score Upper bound? CS score Upper bound: Real: PAS: ClustalW: Dummy: Figure 8: Comparison of dierent msa methods: the CS score (second column) is calculated using a TSP ordering. The upper bound is the CS score based on the optimal pairwise alignment. The smaller the dierence between the upper bound and the CS score is (third column) the better the alignment 5 Discussion We have dened a new scoring function, called CS measure, for the evaluation of msas that is based on a Markovian model of evolution. The frequently used SP measure, which calculates the sum of the scores of all pairwise alignments, ignores the structure of any associated phylogenetic tree and thus weights some evolutionary events more than others. This can be avoided by traversing the tree in a circular order, where all branches are traversed exactly twice (Eulerian tour). Any other non-circular tour results in a longer path, because some branches are traversed more than twice, and hence some evolutionary events are counted more often, which corresponds to a lower probability. Therefore the shortest path through the tree, wich is an Eulerian tour, yields the highest CS score. Such a tour can be calculated with any TSP algorithm. 20
21 The TSP application allows us to avoid the calculation of an evolutionary tree altogether, and only the sequences at the leaves are needed with their OPA or MPA scores. In addition, the CS measure yields a direct connection between a msa and the associated evolutionary tree, because the formula for the calculation of the score is exactly the same. This allows us to use the CS measure to construct and evaluate evolutionary trees. The CS measure based on the OPA scores gives an upper bound on the score of the optimal msa and evolutionary tree. To illustrate the new scoring tool we simulated evolution by generating random trees according to a Markovian model with the corresponding msa. The CS score of the generated alignment was then compared to the score of alignments calculated by dierent methods (MSA, ClustalW, PAS, and MAP). For less than twenty sequences MSA and PAS gave the best score, whereas for larger samples PAS and ClustalW calculated the best alignments. A better assessment must come with actual experimental sequence data though. With the new CS measure we have now the possibility of improving current algorithms and nding new heuristics by maximizing the scoring function. The description of the new methodologies exceeds the scope of this paper and will be available in (Gonnet and Korostensky, 1998). Our scoring function and msa algorithms can be used freely through the CBRG server at 21
22 References Altschul, S. F. (1989). Gap costs for multiple sequence alignment. J. theoretical Biology, 138:297{309. Dayho, M. O., Schwartz, R. M., and Orcutt, B. C. (1978). A model for evolutionary change in proteins. In Dayho, M. O., editor, Atlas of Protein Sequence and Structure, volume 5, pages 345{352. National Biochemical Research Foundation, Washington DC. Gonnet, G. H. (1994a). New algorithms for the computation of evolutionary phylogenetic trees. In Suhai, S., editor, Computational Methods In Genome Research, pages 153{ 161. Plenum Press, New York. Gonnet, G. H. (1994b). A tutorial introduction to computational biochemistry using Darwin. Gonnet, G. H. and Benner, S. A. (1996). Probabilistic ancestral sequences and multiple alignments. Fifth Scandinavian Workshop on Algorithm Theory, Reykjevik July Gonnet, G. H., Cohen, M. A., and Benner, S. A. (1992). Exhaustive matching of the entire protein sequence database. Science, 256:1443{1445. Gonnet, G. H. and Korostensky, C. (1998). New heuristics for multiple sequence alignments. In preparation. 22
23 Gupta, S. K., Kececioglu, J., and Schaer, A. A. (1996). Improving the practical space and time eciency of the shortest-paths approach to sum-of-pairs multiple sequence alignment. J. Computational Biology. To appear. Higgins, D. and Sharp, P. (1989). Fast abd sensitive multiple sequence alignments on a microcomputer. CABIOS, 5:151{153. Huang, X. (1994). On global sequence alignment. CABIOS, 10(3):227{235. Lawrence, C. E., Altschul, S. F., Boguski, M. S., Liu, J. S., Neuwald, A. F., and Wootton, J. C. (1993). Detecting subtle sequence signals: A gibbs sampling strategy for multiple alignment. Science, 262:208{214. Lipman, D. J., Altschul, S. F., and Kececioglu, J. D. (1989). A tool for multiple sequence alignment. Proc. Natl. Acad. Sci. USA, 86:4412{4415. Needleman, S. B. and Wunsch, C. D. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol., 48:443{453. Thompson, J., Higgins, D., and Gibson, T. (1994). Clustal w: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positionsspecic gap penalties and weight matrix choice. Nucleic Acids Research, 22:4673{
Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment
Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot
More informationSara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)
Bioinformática Sequence Alignment Pairwise Sequence Alignment Universidade da Beira Interior (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) 1 16/3/29 & 23/3/29 27/4/29 Outline
More informationSingle alignment: Substitution Matrix. 16 march 2017
Single alignment: Substitution Matrix 16 march 2017 BLOSUM Matrix BLOSUM Matrix [2] (Blocks Amino Acid Substitution Matrices ) It is based on the amino acids substitutions observed in ~2000 conserved block
More informationEffects of Gap Open and Gap Extension Penalties
Brigham Young University BYU ScholarsArchive All Faculty Publications 200-10-01 Effects of Gap Open and Gap Extension Penalties Hyrum Carroll hyrumcarroll@gmail.com Mark J. Clement clement@cs.byu.edu See
More informationLecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)
Bioinformatics II Probability and Statistics Universität Zürich and ETH Zürich Spring Semester 2009 Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM) Dr Fraser Daly adapted from
More informationQuantifying sequence similarity
Quantifying sequence similarity Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, February 16 th 2016 After this lecture, you can define homology, similarity, and identity
More informationTHEORY. Based on sequence Length According to the length of sequence being compared it is of following two types
Exp 11- THEORY Sequence Alignment is a process of aligning two sequences to achieve maximum levels of identity between them. This help to derive functional, structural and evolutionary relationships between
More informationIn-Depth Assessment of Local Sequence Alignment
2012 International Conference on Environment Science and Engieering IPCBEE vol.3 2(2012) (2012)IACSIT Press, Singapoore In-Depth Assessment of Local Sequence Alignment Atoosa Ghahremani and Mahmood A.
More informationLocal Alignment Statistics
Local Alignment Statistics Stephen Altschul National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, MD Central Issues in Biological Sequence Comparison
More informationPairwise & Multiple sequence alignments
Pairwise & Multiple sequence alignments Urmila Kulkarni-Kale Bioinformatics Centre 411 007 urmila@bioinfo.ernet.in Basis for Sequence comparison Theory of evolution: gene sequences have evolved/derived
More informationSequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University
Sequence Alignment: A General Overview COMP 571 - Fall 2010 Luay Nakhleh, Rice University Life through Evolution All living organisms are related to each other through evolution This means: any pair of
More informationLecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM).
1 Bioinformatics: In-depth PROBABILITY & STATISTICS Spring Semester 2011 University of Zürich and ETH Zürich Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM). Dr. Stefanie Muff
More informationAlgorithms in Bioinformatics
Algorithms in Bioinformatics Sami Khuri Department of omputer Science San José State University San José, alifornia, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri Pairwise Sequence Alignment Homology
More informationInDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9
Lecture 5 Alignment I. Introduction. For sequence data, the process of generating an alignment establishes positional homologies; that is, alignment provides the identification of homologous phylogenetic
More informationRECOVERING NORMAL NETWORKS FROM SHORTEST INTER-TAXA DISTANCE INFORMATION
RECOVERING NORMAL NETWORKS FROM SHORTEST INTER-TAXA DISTANCE INFORMATION MAGNUS BORDEWICH, KATHARINA T. HUBER, VINCENT MOULTON, AND CHARLES SEMPLE Abstract. Phylogenetic networks are a type of leaf-labelled,
More informationPairwise sequence alignment
Department of Evolutionary Biology Example Alignment between very similar human alpha- and beta globins: GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKL G+ +VK+HGKKV A+++++AH+D++ +++++LS+LH KL GNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKL
More informationSubstitution matrices
Introduction to Bioinformatics Substitution matrices Jacques van Helden Jacques.van-Helden@univ-amu.fr Université d Aix-Marseille, France Lab. Technological Advances for Genomics and Clinics (TAGC, INSERM
More informationPhylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches
Int. J. Bioinformatics Research and Applications, Vol. x, No. x, xxxx Phylogenies Scores for Exhaustive Maximum Likelihood and s Searches Hyrum D. Carroll, Perry G. Ridge, Mark J. Clement, Quinn O. Snell
More informationBioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre
Bioinformatics Scoring Matrices David Gilbert Bioinformatics Research Centre www.brc.dcs.gla.ac.uk Department of Computing Science, University of Glasgow Learning Objectives To explain the requirement
More informationPage 1. Evolutionary Trees. Why build evolutionary tree? Outline
Page Evolutionary Trees Russ. ltman MI S 7 Outline. Why build evolutionary trees?. istance-based vs. character-based methods. istance-based: Ultrametric Trees dditive Trees. haracter-based: Perfect phylogeny
More information3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT
3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT.03.239 25.09.2012 SEQUENCE ANALYSIS IS IMPORTANT FOR... Prediction of function Gene finding the process of identifying the regions of genomic DNA that encode
More informationSequence Alignment: Scoring Schemes. COMP 571 Luay Nakhleh, Rice University
Sequence Alignment: Scoring Schemes COMP 571 Luay Nakhleh, Rice University Scoring Schemes Recall that an alignment score is aimed at providing a scale to measure the degree of similarity (or difference)
More informationCopyright 2000 N. AYDIN. All rights reserved. 1
Introduction to Bioinformatics Prof. Dr. Nizamettin AYDIN naydin@yildiz.edu.tr Multiple Sequence Alignment Outline Multiple sequence alignment introduction to msa methods of msa progressive global alignment
More informationMoreover, the circular logic
Moreover, the circular logic How do we know what is the right distance without a good alignment? And how do we construct a good alignment without knowing what substitutions were made previously? ATGCGT--GCAAGT
More informationSequence Alignment Techniques and Their Uses
Sequence Alignment Techniques and Their Uses Sarah Fiorentino Since rapid sequencing technology and whole genomes sequencing, the amount of sequence information has grown exponentially. With all of this
More informationEvolutionary Tree Analysis. Overview
CSI/BINF 5330 Evolutionary Tree Analysis Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Backgrounds Distance-Based Evolutionary Tree Reconstruction Character-Based
More informationA New Similarity Measure among Protein Sequences
A New Similarity Measure among Protein Sequences Kuen-Pin Wu, Hsin-Nan Lin, Ting-Yi Sung and Wen-Lian Hsu * Institute of Information Science Academia Sinica, Taipei 115, Taiwan Abstract Protein sequence
More informationBLAST: Target frequencies and information content Dannie Durand
Computational Genomics and Molecular Biology, Fall 2016 1 BLAST: Target frequencies and information content Dannie Durand BLAST has two components: a fast heuristic for searching for similar sequences
More informationAn Introduction to Sequence Similarity ( Homology ) Searching
An Introduction to Sequence Similarity ( Homology ) Searching Gary D. Stormo 1 UNIT 3.1 1 Washington University, School of Medicine, St. Louis, Missouri ABSTRACT Homologous sequences usually have the same,
More informationSequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013
Sequence Alignments Dynamic programming approaches, scoring, and significance Lucy Skrabanek ICB, WMC January 31, 213 Sequence alignment Compare two (or more) sequences to: Find regions of conservation
More informationBiochemistry 324 Bioinformatics. Pairwise sequence alignment
Biochemistry 324 Bioinformatics Pairwise sequence alignment How do we compare genes/proteins? When we have sequenced a genome, we try and identify the function of unknown genes by finding a similar gene
More informationGibbs Sampling Methods for Multiple Sequence Alignment
Gibbs Sampling Methods for Multiple Sequence Alignment Scott C. Schmidler 1 Jun S. Liu 2 1 Section on Medical Informatics and 2 Department of Statistics Stanford University 11/17/99 1 Outline Statistical
More informationSequence Alignment (chapter 6)
Sequence lignment (chapter 6) he biological problem lobal alignment Local alignment Multiple alignment Introduction to bioinformatics, utumn 6 Background: comparative genomics Basic question in biology:
More informationPairwise sequence alignments
Pairwise sequence alignments Volker Flegel VI, October 2003 Page 1 Outline Introduction Definitions Biological context of pairwise alignments Computing of pairwise alignments Some programs VI, October
More informationSequence Database Search Techniques I: Blast and PatternHunter tools
Sequence Database Search Techniques I: Blast and PatternHunter tools Zhang Louxin National University of Singapore Outline. Database search 2. BLAST (and filtration technique) 3. PatternHunter (empowered
More informationBINF6201/8201. Molecular phylogenetic methods
BINF60/80 Molecular phylogenetic methods 0-7-06 Phylogenetics Ø According to the evolutionary theory, all life forms on this planet are related to one another by descent. Ø Traditionally, phylogenetics
More informationHomology Modeling. Roberto Lins EPFL - summer semester 2005
Homology Modeling Roberto Lins EPFL - summer semester 2005 Disclaimer: course material is mainly taken from: P.E. Bourne & H Weissig, Structural Bioinformatics; C.A. Orengo, D.T. Jones & J.M. Thornton,
More informationDr. Amira A. AL-Hosary
Phylogenetic analysis Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic Basics: Biological
More informationProtein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche
Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche The molecular structure of a protein can be broken down hierarchically. The primary structure of a protein is simply its
More information5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT
5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT.03.239 03.10.2012 ALIGNMENT Alignment is the task of locating equivalent regions of two or more sequences to maximize their similarity. Homology:
More informationSequence analysis and Genomics
Sequence analysis and Genomics October 12 th November 23 rd 2 PM 5 PM Prof. Peter Stadler Dr. Katja Nowick Katja: group leader TFome and Transcriptome Evolution Bioinformatics group Paul-Flechsig-Institute
More informationLecture 15 - NP Completeness 1
CME 305: Discrete Mathematics and Algorithms Instructor: Professor Aaron Sidford (sidford@stanford.edu) February 29, 2018 Lecture 15 - NP Completeness 1 In the last lecture we discussed how to provide
More informationMultiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17:
Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17:50 5001 5 Multiple Sequence Alignment The first part of this exposition is based on the following sources, which are recommended reading:
More informationAmira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut
Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic analysis Phylogenetic Basics: Biological
More informationNon-independence in Statistical Tests for Discrete Cross-species Data
J. theor. Biol. (1997) 188, 507514 Non-independence in Statistical Tests for Discrete Cross-species Data ALAN GRAFEN* AND MARK RIDLEY * St. John s College, Oxford OX1 3JP, and the Department of Zoology,
More informationC E N T R. Introduction to bioinformatics 2007 E B I O I N F O R M A T I C S V U F O R I N T. Lecture 5 G R A T I V. Pair-wise Sequence Alignment
C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U Introduction to bioinformatics 2007 Lecture 5 Pair-wise Sequence Alignment Bioinformatics Nothing in Biology makes sense except in
More informationPractical considerations of working with sequencing data
Practical considerations of working with sequencing data File Types Fastq ->aligner -> reference(genome) coordinates Coordinate files SAM/BAM most complete, contains all of the info in fastq and more!
More informationPairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55
Pairwise Alignment Guan-Shieng Huang shieng@ncnu.edu.tw Dept. of CSIE, NCNU Pairwise Alignment p.1/55 Approach 1. Problem definition 2. Computational method (algorithms) 3. Complexity and performance Pairwise
More informationSequence Analysis '17- lecture 8. Multiple sequence alignment
Sequence Analysis '17- lecture 8 Multiple sequence alignment Ex5 explanation How many random database search scores have e-values 10? (Answer: 10!) Why? e-value of x = m*p(s x), where m is the database
More informationAlignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)
Alignment principles and homology searching using (PSI-)BLAST Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU) http://ibivu.cs.vu.nl Bioinformatics Nothing in Biology makes sense except in
More informationSequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment
Sequence Analysis 17: lecture 5 Substitution matrices Multiple sequence alignment Substitution matrices Used to score aligned positions, usually of amino acids. Expressed as the log-likelihood ratio of
More informationBio nformatics. Lecture 3. Saad Mneimneh
Bio nformatics Lecture 3 Sequencing As before, DNA is cut into small ( 0.4KB) fragments and a clone library is formed. Biological experiments allow to read a certain number of these short fragments per
More informationFirst generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences
First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences 140.638 where do sequences come from? DNA is not hard to extract (getting DNA from a
More informationLecture 3: Markov chains.
1 BIOINFORMATIK II PROBABILITY & STATISTICS Summer semester 2008 The University of Zürich and ETH Zürich Lecture 3: Markov chains. Prof. Andrew Barbour Dr. Nicolas Pétrélis Adapted from a course by Dr.
More informationResearch Proposal. Title: Multiple Sequence Alignment used to investigate the co-evolving positions in OxyR Protein family.
Research Proposal Title: Multiple Sequence Alignment used to investigate the co-evolving positions in OxyR Protein family. Name: Minjal Pancholi Howard University Washington, DC. June 19, 2009 Research
More informationJuly 18, Approximation Algorithms (Travelling Salesman Problem)
Approximation Algorithms (Travelling Salesman Problem) July 18, 2014 The travelling-salesman problem Problem: given complete, undirected graph G = (V, E) with non-negative integer cost c(u, v) for each
More informationPhylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz
Phylogenetic Trees What They Are Why We Do It & How To Do It Presented by Amy Harris Dr Brad Morantz Overview What is a phylogenetic tree Why do we do it How do we do it Methods and programs Parallels
More informationBiologically significant sequence alignments using Boltzmann probabilities
Biologically significant sequence alignments using Boltzmann probabilities P. Clote Department of Biology, Boston College Gasson Hall 416, Chestnut Hill MA 02467 clote@bc.edu May 7, 2003 Abstract In this
More informationReading for Lecture 13 Release v10
Reading for Lecture 13 Release v10 Christopher Lee November 15, 2011 Contents 1 Evolutionary Trees i 1.1 Evolution as a Markov Process...................................... ii 1.2 Rooted vs. Unrooted Trees........................................
More informationComputational Biology
Computational Biology Lecture 6 31 October 2004 1 Overview Scoring matrices (Thanks to Shannon McWeeney) BLAST algorithm Start sequence alignment 2 1 What is a homologous sequence? A homologous sequence,
More informationFast computation of maximum likelihood trees by numerical approximation of amino acid replacement probabilities
Computational Statistics & Data Analysis 40 (2002) 285 291 www.elsevier.com/locate/csda Fast computation of maximum likelihood trees by numerical approximation of amino acid replacement probabilities T.
More information2 Notation and Preliminaries
On Asymmetric TSP: Transformation to Symmetric TSP and Performance Bound Ratnesh Kumar Haomin Li epartment of Electrical Engineering University of Kentucky Lexington, KY 40506-0046 Abstract We show that
More informationSequence analysis and comparison
The aim with sequence identification: Sequence analysis and comparison Marjolein Thunnissen Lund September 2012 Is there any known protein sequence that is homologous to mine? Are there any other species
More informationEVOLUTIONARY DISTANCES
EVOLUTIONARY DISTANCES FROM STRINGS TO TREES Luca Bortolussi 1 1 Dipartimento di Matematica ed Informatica Università degli studi di Trieste luca@dmi.units.it Trieste, 14 th November 2007 OUTLINE 1 STRINGS:
More informationPairwise sequence alignments. Vassilios Ioannidis (From Volker Flegel )
Pairwise sequence alignments Vassilios Ioannidis (From Volker Flegel ) Outline Introduction Definitions Biological context of pairwise alignments Computing of pairwise alignments Some programs Importance
More informationExercise 5. Sequence Profiles & BLAST
Exercise 5 Sequence Profiles & BLAST 1 Substitution Matrix (BLOSUM62) Likelihood to substitute one amino acid with another Figure taken from https://en.wikipedia.org/wiki/blosum 2 Substitution Matrix (BLOSUM62)
More informationStatistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences
Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD William and Nancy Thompson Missouri Distinguished Professor Department
More information7.1 Basis for Boltzmann machine. 7. Boltzmann machines
7. Boltzmann machines this section we will become acquainted with classical Boltzmann machines which can be seen obsolete being rarely applied in neurocomputing. It is interesting, after all, because is
More informationStatistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences
Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD Department of Computer Science University of Missouri 2008 Free for Academic
More informationA greedy, graph-based algorithm for the alignment of multiple homologous gene lists
A greedy, graph-based algorithm for the alignment of multiple homologous gene lists Jan Fostier, Sebastian Proost, Bart Dhoedt, Yvan Saeys, Piet Demeester, Yves Van de Peer, and Klaas Vandepoele Bioinformatics
More informationO 3 O 4 O 5. q 3. q 4. Transition
Hidden Markov Models Hidden Markov models (HMM) were developed in the early part of the 1970 s and at that time mostly applied in the area of computerized speech recognition. They are first described in
More information"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky
MOLECULAR PHYLOGENY "Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky EVOLUTION - theory that groups of organisms change over time so that descendeants differ structurally
More informationCSE 549: Computational Biology. Substitution Matrices
CSE 9: Computational Biology Substitution Matrices How should we score alignments So far, we ve looked at arbitrary schemes for scoring mutations. How can we assign scores in a more meaningful way? Are
More informationLecture Notes: Markov chains
Computational Genomics and Molecular Biology, Fall 5 Lecture Notes: Markov chains Dannie Durand At the beginning of the semester, we introduced two simple scoring functions for pairwise alignments: a similarity
More informationGraph Alignment and Biological Networks
Graph Alignment and Biological Networks Johannes Berg http://www.uni-koeln.de/ berg Institute for Theoretical Physics University of Cologne Germany p.1/12 Networks in molecular biology New large-scale
More informationBioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment
Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Substitution score matrices, PAM, BLOSUM Needleman-Wunsch algorithm (Global) Smith-Waterman algorithm (Local) BLAST (local, heuristic) E-value
More informationMultiple Sequence Alignment
Multiple Sequence Alignment Multiple Alignment versus Pairwise Alignment Up until now we have only tried to align two sequences. What about more than two? And what for? A faint similarity between two sequences
More informationPhylogeny and Evolution. Gina Cannarozzi ETH Zurich Institute of Computational Science
Phylogeny and Evolution Gina Cannarozzi ETH Zurich Institute of Computational Science History Aristotle (384-322 BC) classified animals. He found that dolphins do not belong to the fish but to the mammals.
More informationModule: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment
Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment Introduction to Bioinformatics online course : IBT Jonathan Kayondo Learning Objectives Understand
More informationAn ant colony algorithm for multiple sequence alignment in bioinformatics
An ant colony algorithm for multiple sequence alignment in bioinformatics Jonathan Moss and Colin G. Johnson Computing Laboratory University of Kent at Canterbury Canterbury, Kent, CT2 7NF, England. C.G.Johnson@ukc.ac.uk
More informationBackground: comparative genomics. Sequence similarity. Homologs. Similarity vs homology (2) Similarity vs homology. Sequence Alignment (chapter 6)
Sequence lignment (chapter ) he biological problem lobal alignment Local alignment Multiple alignment Background: comparative genomics Basic question in biology: what properties are shared among organisms?
More informationPage 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence
Page Hidden Markov models and multiple sequence alignment Russ B Altman BMI 4 CS 74 Some slides borrowed from Scott C Schmidler (BMI graduate student) References Bioinformatics Classic: Krogh et al (994)
More informationA profile-based protein sequence alignment algorithm for a domain clustering database
A profile-based protein sequence alignment algorithm for a domain clustering database Lin Xu,2 Fa Zhang and Zhiyong Liu 3, Key Laboratory of Computer System and architecture, the Institute of Computing
More informationAlignment & BLAST. By: Hadi Mozafari KUMS
Alignment & BLAST By: Hadi Mozafari KUMS SIMILARITY - ALIGNMENT Comparison of primary DNA or protein sequences to other primary or secondary sequences Expecting that the function of the similar sequence
More informationBIOINFORMATICS: An Introduction
BIOINFORMATICS: An Introduction What is Bioinformatics? The term was first coined in 1988 by Dr. Hwa Lim The original definition was : a collective term for data compilation, organisation, analysis and
More information... and searches for related sequences probably make up the vast bulk of bioinformatics activities.
1 2 ... and searches for related sequences probably make up the vast bulk of bioinformatics activities. 3 The terms homology and similarity are often confused and used incorrectly. Homology is a quality.
More informationConcepts and Methods in Molecular Divergence Time Estimation
Concepts and Methods in Molecular Divergence Time Estimation 26 November 2012 Prashant P. Sharma American Museum of Natural History Overview 1. Why do we date trees? 2. The molecular clock 3. Local clocks
More informationIntroduction to Bioinformatics
Introduction to Bioinformatics Jianlin Cheng, PhD Department of Computer Science Informatics Institute 2011 Topics Introduction Biological Sequence Alignment and Database Search Analysis of gene expression
More information1.5 Sequence alignment
1.5 Sequence alignment The dramatic increase in the number of sequenced genomes and proteomes has lead to development of various bioinformatic methods and algorithms for extracting information (data mining)
More informationCONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018
CONCEPT OF SEQUENCE COMPARISON Natapol Pornputtapong 18 January 2018 SEQUENCE ANALYSIS - A ROSETTA STONE OF LIFE Sequence analysis is the process of subjecting a DNA, RNA or peptide sequence to any of
More informationStatistical Distributions of Optimal Global Alignment Scores of Random Protein Sequences
BMC Bioinformatics This Provisional PDF corresponds to the article as it appeared upon acceptance. The fully-formatted PDF version will become available shortly after the date of publication, from the
More informationProtein Threading. BMI/CS 776 Colin Dewey Spring 2015
Protein Threading BMI/CS 776 www.biostat.wisc.edu/bmi776/ Colin Dewey cdewey@biostat.wisc.edu Spring 2015 Goals for Lecture the key concepts to understand are the following the threading prediction task
More informationHidden Markov Models
Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training
More informationMultiple Sequence Alignment
Multiple Sequence Alignment Multiple Alignment versus Pairwise Alignment Up until now we have only tried to align two sequences.! What about more than two? And what for?! A faint similarity between two
More informationSequence Analysis '17 -- lecture 7
Sequence Analysis '17 -- lecture 7 Significance E-values How significant is that? Please give me a number for......how likely the data would not have been the result of chance,......as opposed to......a
More informationCISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)
CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I) Contents Alignment algorithms Needleman-Wunsch (global alignment) Smith-Waterman (local alignment) Heuristic algorithms FASTA BLAST
More informationAn Introduction to Bioinformatics Algorithms Hidden Markov Models
Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training
More informationCHAPTER 3 FUNDAMENTALS OF COMPUTATIONAL COMPLEXITY. E. Amaldi Foundations of Operations Research Politecnico di Milano 1
CHAPTER 3 FUNDAMENTALS OF COMPUTATIONAL COMPLEXITY E. Amaldi Foundations of Operations Research Politecnico di Milano 1 Goal: Evaluate the computational requirements (this course s focus: time) to solve
More informationGeneral context Anchor-based method Evaluation Discussion. CoCoGen meeting. Accuracy of the anchor-based strategy for genome alignment.
CoCoGen meeting Accuracy of the anchor-based strategy for genome alignment Raluca Uricaru LIRMM, CNRS Université de Montpellier 2 3 octobre 2008 1 / 31 Summary 1 General context 2 Global alignment : anchor-based
More informationTools and Algorithms in Bioinformatics
Tools and Algorithms in Bioinformatics GCBA815, Fall 2015 Week-4 BLAST Algorithm Continued Multiple Sequence Alignment Babu Guda, Ph.D. Department of Genetics, Cell Biology & Anatomy Bioinformatics and
More information