# An Introduction to Sequence Similarity ( Homology ) Searching

Save this PDF as:

Size: px
Start display at page:

Download "An Introduction to Sequence Similarity ( Homology ) Searching"

## Transcription

1 An Introduction to Sequence Similarity ( Homology ) Searching Gary D. Stormo 1 UNIT Washington University, School of Medicine, St. Louis, Missouri ABSTRACT Homologous sequences usually have the same, or very similar, functions, so new sequences can be reliably assigned functions if homologous sequences with known functions can be identified. Homology is inferred based on sequence similarity, and many methods have been developed to identify sequences that have statistically significant similarity. This unit provides an overview of some of the basic issues in identifying similarity among sequences and points out other units in this chapter that describe specific programs that are useful for this task. Curr. Protoc. Bioinform. 27: C 2009 by John Wiley & Sons, Inc. Keywords: sequence similarity homology dynamic programming similarity-scoring matrices sequence alignment multiple alignment sequence evolution AN INTRODUCTION TO IDENTIFYING HOMOLOGOUS SEQUENCES Upon determining a new sequence, one common objective is to assign a function to that sequence, or perhaps multiple functions to various subsequences. Because homologous sequences, those related by common ancestry, generally have the same or very similar functions, their identification can be used to infer the function of the new sequence. The methods for identifying homologous sequences rely on finding similarities between sequences that are unlikely to happen by chance, thereby allowing the inference that the sequences have evolved from a common ancestor. The units in this Chapter all address aspects of this problem, finding significant similarity between new sequences and those that are currently in the databases of known sequences. This chapter introduction provides a primer on five fundamental issues related to similarity searches: (1) how to find the best alignment between two sequences; (2) methods for scoring an alignment; (3) speeding up database searches; (4) determining statistical significance; and (5) using information in multiple alignments to improve sensitivity. OPTIMAL SEQUENCE ALIGNMENTS Given any two sequences, there are an enormous number of ways they can be aligned if one allows for insertions and deletions (collectively referred to as indels or gaps) in the alignment. In order to assess whether the similarity score between two sequences is significant, one must obtain the highest possible score; otherwise one will underestimate the significance. Fortunately, there is a simple method, known as dynamic programming (DP), to determine the score of the best possible alignment between two sequences. Figure outlines the DP algorithm. The two sequences, A and B, are written across the top and down the side of the DP matrix. Each element of the matrix, S(i,j), represents the score of the best alignment from the beginning of each sequence up to residues a i,in sequence A, and b j, in sequence B. That element is determined by comparing just three numbers, which represent the three possible end points of those alignments: a i is aligned to b j ;a i is aligned to a gap in sequence B; b j is aligned to a gap in sequence A. The three numbers to be compared include the addition of the gap score, δ (which is always Current Protocols in Bioinformatics , September 2009 Published online September 2009 in Wiley Interscience ( DOI: / bi0301s27 Copyright C 2009 John Wiley & Sons, Inc

2 Seq B: b 1 Seq A: a 1 a 2 b 2 ai an b j b m S(i 1, j 1) S(i, j 1) S(i 1, j) S(i, j) S(i,j) max S(i 1, j 1) score(a i, b j ) S(i, j 1) S(i 1, j) Figure Dynamic programming algorithm for optimum sequence alignment. The two sequences are written across the top and along the right side of the matrix. The score of each element is determined by the simple rules shown for the enlarged section and described by the equations below it. (The top row and left column have special rules as described in the text.) The score of the best global alignment is the element S(n,m) and the alignment with that score can be obtained by backtracking through the matrix, determining the path that generated the score at each element. An Introduction to Sequence Similarity Searching negative because gaps are always penalized), to the elements directly above, S(i,j-1), and to the left, S(i-1,j), corresponding to the gaps in sequence B and A, respectively. The other number is the sum of the score of aligning a i with b j, which depends on what those residues are (see next section) and the score of the best alignment for the preceding subsequences, S(i-1,j-1). The top row and first column require special treatment, but the rule is still simple. Aligning a 1 with b 1 just gets score(a 1,b 1 ), but aligning a 1 with b j Current Protocols in Bioinformatics

3 requires j-1 gaps in sequence B, so S(1,j)=(j-1)δ + score(a 1,b j ). Once the top row and first column are initialized, the entire matrix can be filled following the simple rule shown in Figure 3.1.1, and the score of the best possible alignment of the two sequences is the element in the lower right corner of the matrix, S(n,m). The actual alignment that gives that score can be found by tracing backwards through the matrix, starting from the final element, to identify the preceding element that gave rise to each score. Keep in mind that there may be more than one alignment with the highest possible score, but most programs only show one of them. The result of the DP algorithm described above is a global alignment of the two sequences, where each residue of each sequence is included in the alignment and contributes to the score. However, often one is interested in the best local alignment, an alignment that may contain only a portion of one, or both, sequences. For example, two proteins may have a common domain that one would like to identify but the remaining parts of the sequences are unrelated. Alternatively, perhaps one is comparing a newly sequenced gene to the entire genome of another species and you only expect it to match the segment corresponding to the homologous gene. A simple modification of the basic DP algorithm can be used to identify the best local alignment, as first deduced by Smith and Waterman (1981). The simple change is that only positive values are kept in the DP matrix, which just means that zero is one of the values to be compared and the maximum taken for S(i,j). Now the score of the best alignment is not necessarily the element S(n,m), but may occur anywhere within the matrix. Once the highest score is found, the alignment that generated that score can be found by tracing backwards through the matrix, as before, but stopping when the first zero is encountered. One additional modification to the basic DP algorithm is often used. It is known from extensive studies of protein sequences that gaps of multiple residues are much more frequent than the same number of separate gaps. For example, a single deletion event might remove three residues at once, and that event is much more likely than for three separate events that each delete one residue. To capture this effect, one can use two gap penalties, one that counts the occurrence of a gap, usually referred to as the gap-opening penalty, and a different, lower one, that is proportional to the length of the gap. This can be easily incorporated into the DP algorithm with only a minor increase in the time required to find the best alignment (Fitch and Smith, 1983). The DP algorithm is guaranteed to give the highest-scoring alignment of the two sequences if the only evolutionary processes allowed are substitutions and indels. While real evolutionary processes also include duplications, transpositions, and inversions, those are not considered, as they cannot be handled by the DP approach and would drastically increase the time required to find the optimal alignment. However, it is generally seen that substitutions and indels are much more common than the other processes, so finding the best aligned considering only those processes will most frequently indicate the true optimum alignment. SCORING SEQUENCE SIMILARITY UNIT 3.5 discusses the construction of protein similarity matrices and considerations to take into account when choosing which to use for a specific search. This section briefly describes some of fundamental issues regarding similarity scores. The score of an alignment is the sum of the scores for each position in the alignment, as we saw in the DP algorithm for finding the highest-scoring alignment. For every pair of residues, which might be RNA, DNA, or protein, there is a score defined such that the more similar the sequences are, the higher the score. For RNA and DNA, the scores often just distinguish between matches, where the two residues are the same, and mismatches. However, sometimes it is useful to distinguish between mismatches that are transitions, changes from one purine or pyrimidine to the other, from those that are transversions, changes between a purine and a pyrimidine. For proteins, the situation is more complex and one needs a Current Protocols in Bioinformatics

4 score for all possible amino acid pairs and, ideally, one that reflects the probability of one amino acid being replaced by another during the course of evolution. Replacements are governed by two processes. The first is the mutation that changes one amino acid to another, and this will be much more likely for amino acids that have similar codons. For example, a single transition mutation can change a His to a Tyr (CAT to TAT), but three transversions would be necessary to change a His to a Met (CAT to ATG). The second process is that the mutation must survive selection for it to be observed, and this requires that the new amino acid can substitute for the previous one without disrupting the protein s function. This is more likely if the two amino acids have similar properties, such as size, charge, and hydrophobicity. Which properties are most important may depend on where the amino acid is within the protein sequence and what role it plays in determining the structure and function of the protein. Some positions in a protein may be highly constrained and only a few, perhaps only one, amino acids will work, whereas other positions may tolerate many different amino acids and therefore be highly variable in related protein sequences. This implies that one would really like to have different scoring matrices depending on the constraints at particular positions in the protein, but this would require many more parameters, and often one is lacking the information needed to choose the appropriate matrix. This problem can be partially addressed using multiple alignments of protein families, as described below. However, for general database searches, where one hopes to identify homologous sequences in the collection of known sequences, a general similarity-scoring matrix, which is the average over various types of constrained positions, is often utilized. Most commonly used similarity matrices are empirically based on reliable alignments of protein families. For instance, the BLOSUM matrices (UNIT 3.5) are based on the alignments in the BLOCKS protein family alignments (UNIT 2.2). These are reliable alignments of protein domains across many different protein families. The score assigned to a specific pair of residues is the log-odds of their co-occurrence in aligned positions compared to what would be expected by chance given their individual frequencies. So for two amino acids, a and b, the similarity score is: score ( ab, )= ln f( a, b) f()() a f b where f(a,b) is the observed frequency at which those amino acids are aligned and f(a) and f(b) are their frequencies in the entire database. If the two amino acids occur together at a frequency expected by chance, which is the product of the independent frequencies, then the score is 0. Amino acids that occur together more often than expected by chance get positive scores and ones that occur less frequently get negative scores. The score for aligning an amino acid with itself is always the highest score for that amino acid, as it is more likely to occur than any substitution. Positive-scoring pairs represent amino acids with similar properties, and they are often observed to substitute for one another, whereas negative scoring pairs are dissimilar amino acids that are less likely to substitute for each other. It is important to note that the expected score of a random alignment is negative. In addition to the amino acid similarity scores, one also needs to have penalties for indels in the alignment, and those values are also based empirically on many reliable alignments over many different protein families. FAST SEARCHING METHODS An Introduction to Sequence Similarity Searching The DP method described above is guaranteed to find the highest-scoring alignment between two sequences. The time it takes to complete the alignment is proportional to the number of DP matrix elements to compute, which is the product of the sequence lengths. While this is very fast for comparing any two sequences of reasonable length, it is Current Protocols in Bioinformatics

5 not practical for searching the current sequence databases, which contain many millions of sequences and many billions of residues. Therefore methods have been developed that can search entire databases much faster. While these methods do not guarantee finding the absolute best alignments, they have been finely optimized so that they have very high sensitivities and generally do obtain the optimal, or near-optimal, alignments. The most commonly used method is BLAST (Altschul et al., 1990; UNITS 3.3, 3.4 & 3.11). BLAST is convenient to use through the online service (UNITS 3.3 & 3.4), has access to the major DNA and protein databases, and provides convenient tools for displaying and analyzing the output. One can also download BLAST to a local computer for in-house use (UNIT 3.11). But most importantly BLAST has been shown to be very sensitive, giving results nearly equivalent to running the Smith/Waterman DP algorithm on the whole database but in a small fraction of the time. As a brief and approximate description of how BLAST works, consider the DP matrix of Figure Most of the time is spent determining the scores of elements that do not contribute to the final, optimum alignment. BLAST achieves its speed by eliminating most of the computations and focusing its effort on the highest-scoring paths through the matrix. It does this by creating an index of the words, short subsequences, from the query sequence, and high-scoring matches to those words can be identified very quickly in the database. Those initial matches, or hits, are extended using DP but with cutoffs employed so that when any extension becomes unlikely to lead to a significant match it is terminated. The matches that do lead to significant alignments are extended as far as possible to give a result essentially equivalent to that obtained by the full DP algorithm. Since most sequences in the database, and most of the possible alignments even to the most similar sequences, are eliminated quickly from further consideration, BLAST achieves an enormous increase in speed with only a small loss of sensitivity. THE SIGNIFICANCE OF AN ALIGNMENT SCORE Obtaining a score for an alignment does not, by itself, tell you whether it is significant. One needs to determine what is the probability of observing such a score by chance, given the scoring system used and the lengths of the sequences being compared. It is important to realize that the distribution of optimum alignment scores is not normal. If one were to consider the scores of all possible alignments, one would expect a normal distribution. However, we are only considering the highest-scoring alignments to each sequence in the database, and the distribution of those maximum, or extreme, values will be quite different from the distribution of scores for all alignments. Consider the following simple example. If one throws two dice, the score distribution follows the binomial distribution, and the probability of getting a 12 is 1/36. But if one throws two pair of dice, and records only the highest score, the distribution is quite different, with the probability of getting a 12 nearly twice as high, 71/36 2 to be precise. When searching a large database with a query sequence, one is interested only in the highest-scoring alignments, those that are mostly likely to represent homologous sequences. However, to answer the question of how high a score is required to be unlikely to have occurred by chance, one must know the distribution of highest scores expected by chance. Karlin and Altschul (1990) showed that for alignments without gaps, the distribution of highest scores follows a Gumbel distribution, a type of extreme value distribution. Furthermore, they showed how to compute the dependency of that score distribution on the scoring system that is used and the length of the query and database sequences. In addition, while their theory only holds analytically for ungapped alignments, empirical evidence indicates that it holds quite well for gapped alignments too. This allows one to compute a statistical significance for the best scoring alignments in a database search and from that infer whether the best matches are likely to be homologous sequences or could have arisen by chance. The BLAST program computes these values as part of the output, with Current Protocols in Bioinformatics

6 most people focusing on the E-value, which is the number matches with equal or greater scores that are expected by chance. When E-values are much less than one, they are essentially the same as p-values, the probability of an equal or greater score by chance, but E-values can exceed one, whereas p-values cannot. MAKING AND USING MULTIPLE SEQUENCE ALIGNMENTS The previous sections of this unit have focused on aligning pairs of sequences, or more generally searching an entire database for high-scoring alignments to a query sequence, which requires many pairwise comparisons. However, if one has several members of a protein family, an alignment of those sequences can provide additional information about the sequence constraints at different positions of the proteins. For example, some positions may be critical for the function of those proteins and perhaps only one, or a few, amino acids are allowed at those positions. If that is the case, then a general similarityscoring matrix is not appropriate and one should utilize the information available from the multiple alignments to constrain the search for homologous proteins. The two main issues are how the multiple alignments are made and how they are used to identify new members of the protein family. Several units in this chapter and the previous one address various aspects of the construction and use of multiple sequence alignments. UNIT 3.7 provides an overview of multiple sequence alignment. One can generate optimal multiple alignments by extending the DP method to multiple sequences, following the same rules as described above but expanded to multiple dimensions, one for each sequence. However, the time required to fill in the DP matrix is proportional to the product of the length of all the sequences, which becomes impractical for more than a few sequences (Lipman et al., 1989). Practical methods for multiple sequence alignment use one of two approaches. One is a progressive alignment strategy where pairs of sequences are aligned and then new sequences are added to that alignment, or alignments are aligned to each other, but all steps only involve pairwise comparisons that can be accomplished efficiently using the DP strategy described. The Pileup program from GCG (UNIT 3.6) uses that strategy, as does the ClustalW program (UNIT 2.3), which is probably the most popular method for constructing multiple alignments. Psi-BLAST also uses a similar iterative strategy within the BLAST suite of programs (UNIT 3.4) to build multiple alignments from the matches identified by BLAST. The other strategy uses an interative refinement approach, where an initial approximate alignment is built and then sequences are realigned to the resulting model until the method converges to a final multiple alignment. The method is based on an expectation maximization (EM)algorithm,similar to that used in MEME (UNIT 2.4) but allowing gaps in the alignment. The EM algorithm is at the heart of the Hidden Markov Model (HMM) approach to protein family models first developed by the Haussler group (Krogh et al., 1994; Eddy, 1996; overview in UNIT 3.7 & APPENDIX 3A). In an HMM each position of the alignment is represented by a probability distribution over all of the amino acids and there are probabilities assigned for insertions and deletions. This model captures, in a probabilistic fashion, the natural variability and constraints that are observed in a protein family. The T-Coffee approach (UNIT 3.8) is somewhat different in that it combines, through a consistency-based scoring system, pairwise alignments that can be obtained from a variety of different and independent approaches. For example, if structural information is available and useful for determining the correct alignment that can be incorporated along with other types of information. An Introduction to Sequence Similarity Searching Once a multiple alignment is built, it can be used to identify new sequences that are members of the family. Gribskov et al. (1987) first described such an approach where the multiple alignment is converted to a profile, and new sequences are scored against it using a DP algorithm and a sum-of-pairs method. This is equivalent to finding the best Current Protocols in Bioinformatics

7 alignment of the new sequence to the multiple sequence alignment, where the scoring is the average similarity score to each sequence in the alignment. This is also how ClustalW works internally, as it is building up the multiple alignment (UNIT 2.3). A new sequence can also be aligned to an HMM for a protein family using a version of the DP algorithm, the same method that is used in the iterative realignment for building the HMM. Pfam (UNIT 2.5) is a database of protein families represented as HMMs, and a new sequence can be compared to the database to determine how similar it is to each family and if it can be inferred to belong to any of the families. The InterPro resource of databases and analysis tools (UNIT 2.7) also facilitates the identification of protein families that may be homologous to a new sequence of interest. SUMMARY When one obtains a new sequence and wants to assign a function to it, identifying homologous sequences with known function is a reliable strategy. Two complementary approaches can be easily employed and are generally very useful. The first is to search the databases of known sequences, using programs like BLAST, to determine if there are any that are sufficiently similar to be homologous, and if those sequences have known functions such that information can be transferred with reasonable reliability. One can also search the databases of known protein families that include the extra information from multiple sequence alignments, such as Pfam and the InterPro databases. While those protein family databases are less comprehensive than the single sequence databases, simply because many sequences do not have known family membership, they can be more sensitive at identifying distantly related sequences that may be missed without knowledge of the specific constraints associated with specific families. Therefore, a combined approach that utilizes both types of resources can be most effective in identifying homologous sequences for a new sequence of interest. LITERATURE CITED Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J Basic local alignment search tool. J. Mol. Biol. 215: Eddy, S.R Hidden Markov models. Curr. Opin. Struct. Biol. 6: Fitch, W. and Smith, T Optimal sequence alignments. Proc. Natl. Acad. Sci. U.S.A. 80: Gribskov, M., McLachlan, A.D., and Eisenberg, D Profile analysis: Detection of distantly related proteins. Proc. Natl. Acad. Sci. U.S.A. 84: Karlin, S. and Altschul, S.F Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl. Acad. Sci. U.S.A. 87: Krogh, A., Brown, M., Mian, I.S., Sjölander, K., and Haussler, D Hidden Markov models in computational biology. Applications to protein modeling. J. Mol. Biol. 235: Lipman, D.J., Altschul, S.F., and Kececioglu, J.D A tool for multiple sequence alignment. Proc. Natl. Acad. Sci. U.S.A. 86: Smith, T.F. and Waterman, M.S Identification of common molecular subsequences. J. Mol. Biol. 147: Current Protocols in Bioinformatics

### Local Alignment Statistics

Local Alignment Statistics Stephen Altschul National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, MD Central Issues in Biological Sequence Comparison

### MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME

MATHEMATICAL MODELING AND THE HUMAN GENOME Hilary S. Booth Australian National University, Australia Keywords: Human genome, DNA, bioinformatics, sequence analysis, evolution. Contents 1. Introduction:

### Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD Department of Computer Science University of Missouri 2008 Free for Academic

### Background: comparative genomics. Sequence similarity. Homologs. Similarity vs homology (2) Similarity vs homology. Sequence Alignment (chapter 6)

Sequence lignment (chapter ) he biological problem lobal alignment Local alignment Multiple alignment Background: comparative genomics Basic question in biology: what properties are shared among organisms?

### Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm Alignment scoring schemes and theory: substitution matrices and gap models 1 Local sequence alignments Local sequence alignments are necessary

### Sequence analysis and comparison

The aim with sequence identification: Sequence analysis and comparison Marjolein Thunnissen Lund September 2012 Is there any known protein sequence that is homologous to mine? Are there any other species

### O 3 O 4 O 5. q 3. q 4. Transition

Hidden Markov Models Hidden Markov models (HMM) were developed in the early part of the 1970 s and at that time mostly applied in the area of computerized speech recognition. They are first described in

### Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Bioinformatics II Probability and Statistics Universität Zürich and ETH Zürich Spring Semester 2009 Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM) Dr Fraser Daly adapted from

### Sequence analysis and Genomics

Sequence analysis and Genomics October 12 th November 23 rd 2 PM 5 PM Prof. Peter Stadler Dr. Katja Nowick Katja: group leader TFome and Transcriptome Evolution Bioinformatics group Paul-Flechsig-Institute

### Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Substitution score matrices, PAM, BLOSUM Needleman-Wunsch algorithm (Global) Smith-Waterman algorithm (Local) BLAST (local, heuristic) E-value

### Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment Introduction to Bioinformatics online course : IBT Jonathan Kayondo Learning Objectives Understand

### BLAST. Varieties of BLAST

BLAST Basic Local Alignment Search Tool (1990) Altschul, Gish, Miller, Myers, & Lipman Uses short-cuts or heuristics to improve search speed Like speed-reading, does not examine every nucleotide of database

### Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)

Alignment principles and homology searching using (PSI-)BLAST Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU) http://ibivu.cs.vu.nl Bioinformatics Nothing in Biology makes sense except in

### Tools and Algorithms in Bioinformatics

Tools and Algorithms in Bioinformatics GCBA815, Fall 2013 Week3: Blast Algorithm, theory and practice Babu Guda, Ph.D. Department of Genetics, Cell Biology & Anatomy Bioinformatics and Systems Biology

### Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Sequence Analysis 17: lecture 5 Substitution matrices Multiple sequence alignment Substitution matrices Used to score aligned positions, usually of amino acids. Expressed as the log-likelihood ratio of

### Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre

Bioinformatics Scoring Matrices David Gilbert Bioinformatics Research Centre www.brc.dcs.gla.ac.uk Department of Computing Science, University of Glasgow Learning Objectives To explain the requirement

### Sequence Alignment: Scoring Schemes. COMP 571 Luay Nakhleh, Rice University

Sequence Alignment: Scoring Schemes COMP 571 Luay Nakhleh, Rice University Scoring Schemes Recall that an alignment score is aimed at providing a scale to measure the degree of similarity (or difference)

### Pairwise sequence alignment

Department of Evolutionary Biology Example Alignment between very similar human alpha- and beta globins: GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKL G+ +VK+HGKKV A+++++AH+D++ +++++LS+LH KL GNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKL

### Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Sequence comparison: Score matrices Genome 559: Introduction to Statistical and omputational Genomics Prof James H Thomas FYI - informal inductive proof of best alignment path onsider the last step in

### C E N T R. Introduction to bioinformatics 2007 E B I O I N F O R M A T I C S V U F O R I N T. Lecture 5 G R A T I V. Pair-wise Sequence Alignment

C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U Introduction to bioinformatics 2007 Lecture 5 Pair-wise Sequence Alignment Bioinformatics Nothing in Biology makes sense except in

### Lecture Notes: Markov chains

Computational Genomics and Molecular Biology, Fall 5 Lecture Notes: Markov chains Dannie Durand At the beginning of the semester, we introduced two simple scoring functions for pairwise alignments: a similarity

### BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

BLAST Database Searching BME 110: CompBio Tools Todd Lowe April 8, 2010 Admin Reading: Read chapter 7, and the NCBI Blast Guide and tutorial http://www.ncbi.nlm.nih.gov/blast/why.shtml Read Chapter 8 for

### HMMs and biological sequence analysis

HMMs and biological sequence analysis Hidden Markov Model A Markov chain is a sequence of random variables X 1, X 2, X 3,... That has the property that the value of the current state depends only on the

### Overview Multiple Sequence Alignment

Overview Multiple Sequence Alignment Inge Jonassen Bioinformatics group Dept. of Informatics, UoB Inge.Jonassen@ii.uib.no Definition/examples Use of alignments The alignment problem scoring alignments

### Local Alignment: Smith-Waterman algorithm

Local Alignment: Smith-Waterman algorithm Example: a shared common domain of two protein sequences; extended sections of genomic DNA sequence. Sensitive to detect similarity in highly diverged sequences.

### CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison

CMPS 6630: Introduction to Computational Biology and Bioinformatics Structure Comparison Protein Structure Comparison Motivation Understand sequence and structure variability Understand Domain architecture

### Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM).

1 Bioinformatics: In-depth PROBABILITY & STATISTICS Spring Semester 2011 University of Zürich and ETH Zürich Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM). Dr. Stefanie Muff

### Sequence Analysis '17- lecture 8. Multiple sequence alignment

Sequence Analysis '17- lecture 8 Multiple sequence alignment Ex5 explanation How many random database search scores have e-values 10? (Answer: 10!) Why? e-value of x = m*p(s x), where m is the database

### Tools and Algorithms in Bioinformatics

Tools and Algorithms in Bioinformatics GCBA815, Fall 2015 Week-4 BLAST Algorithm Continued Multiple Sequence Alignment Babu Guda, Ph.D. Department of Genetics, Cell Biology & Anatomy Bioinformatics and

### Bioinformatics Exercises

Bioinformatics Exercises AP Biology Teachers Workshop Susan Cates, Ph.D. Evolution of Species Phylogenetic Trees show the relatedness of organisms Common Ancestor (Root of the tree) 1 Rooted vs. Unrooted

### Massachusetts Institute of Technology Computational Evolutionary Biology, Fall, 2005 Notes for November 7: Molecular evolution

Massachusetts Institute of Technology 6.877 Computational Evolutionary Biology, Fall, 2005 Notes for November 7: Molecular evolution 1. Rates of amino acid replacement The initial motivation for the neutral

### 17 Non-collinear alignment Motivation A B C A B C A B C A B C D A C. This exposition is based on:

17 Non-collinear alignment This exposition is based on: 1. Darling, A.E., Mau, B., Perna, N.T. (2010) progressivemauve: multiple genome alignment with gene gain, loss and rearrangement. PLoS One 5(6):e11147.

### Do Aligned Sequences Share the Same Fold?

J. Mol. Biol. (1997) 273, 355±368 Do Aligned Sequences Share the Same Fold? Ruben A. Abagyan* and Serge Batalov The Skirball Institute of Biomolecular Medicine Biochemistry Department NYU Medical Center

### Generalized Affine Gap Costs for Protein Sequence Alignment

PROTEINS: Structure, Function, and Genetics 32:88 96 (1998) Generalized Affine Gap Costs for Protein Sequence Alignment Stephen F. Altschul* National Center for Biotechnology Information, National Library

### Lecture 1, 31/10/2001: Introduction to sequence alignment. The Needleman-Wunsch algorithm for global sequence alignment: description and properties

Lecture 1, 31/10/2001: Introduction to sequence alignment The Needleman-Wunsch algorithm for global sequence alignment: description and properties 1 Computational sequence-analysis The major goal of computational

### SEPA: Approximate Non-Subjective Empirical p-value Estimation for Nucleotide Sequence Alignment

SEPA: Approximate Non-Subjective Empirical p-value Estimation for Nucleotide Sequence Alignment Ofer Gill and Bud Mishra Courant Institute of Mathematical Sciences, New York University, 251 Mercer Street,

### Gibbs Sampling Methods for Multiple Sequence Alignment

Gibbs Sampling Methods for Multiple Sequence Alignment Scott C. Schmidler 1 Jun S. Liu 2 1 Section on Medical Informatics and 2 Department of Statistics Stanford University 11/17/99 1 Outline Statistical

### Finding the Best Biological Pairwise Alignment Through Genetic Algorithm Determinando o Melhor Alinhamento Biológico Através do Algoritmo Genético

Finding the Best Biological Pairwise Alignment Through Genetic Algorithm Determinando o Melhor Alinhamento Biológico Através do Algoritmo Genético Paulo Mologni 1, Ailton Akira Shinoda 2, Carlos Dias Maciel

### Ch. 9 Multiple Sequence Alignment (MSA)

Ch. 9 Multiple Sequence Alignment (MSA) - gather seqs. to make MSA - doing MSA with ClustalW - doing MSA with Tcoffee - comparing seqs. that cannot align Introduction - from pairwise alignment to MSA -

### 20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, Global and local alignment of two sequences using dynamic programming

20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, 2008 4 Pairwise alignment We will discuss: 1. Strings 2. Dot matrix method for comparing sequences 3. Edit distance 4. Global and local alignment

### Administration. ndrew Torda April /04/2008 [ 1 ]

ndrew Torda April 2008 Administration 22/04/2008 [ 1 ] Sprache? zu verhandeln (Englisch, Hochdeutsch, Bayerisch) Selection of topics Proteins / DNA / RNA Two halves to course week 1-7 Prof Torda (larger

### Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche The molecular structure of a protein can be broken down hierarchically. The primary structure of a protein is simply its

### BIO 285/CSCI 285/MATH 285 Bioinformatics Programming Lecture 8 Pairwise Sequence Alignment 2 And Python Function Instructor: Lei Qian Fisk University

BIO 285/CSCI 285/MATH 285 Bioinformatics Programming Lecture 8 Pairwise Sequence Alignment 2 And Python Function Instructor: Lei Qian Fisk University Measures of Sequence Similarity Alignment with dot

### Hidden Markov Models

Hidden Markov Models Slides revised and adapted to Bioinformática 55 Engª Biomédica/IST 2005 Ana Teresa Freitas Forward Algorithm For Markov chains we calculate the probability of a sequence, P(x) How

### Optimal spaced seeds for faster approximate string matching

Optimal spaced seeds for faster approximate string matching Martin Farach-Colton Gad M. Landau S. Cenk Sahinalp Dekel Tsur Abstract Filtering is a standard technique for fast approximate string matching

### Optimal spaced seeds for faster approximate string matching

Optimal spaced seeds for faster approximate string matching Martin Farach-Colton Gad M. Landau S. Cenk Sahinalp Dekel Tsur Abstract Filtering is a standard technique for fast approximate string matching

### Hidden Markov Models and Their Applications in Biological Sequence Analysis

Hidden Markov Models and Their Applications in Biological Sequence Analysis Byung-Jun Yoon Dept. of Electrical & Computer Engineering Texas A&M University, College Station, TX 77843-3128, USA Abstract

### Sequence-specific sequence comparison using pairwise statistical significance

Graduate Theses and Dissertations Graduate College 2009 Sequence-specific sequence comparison using pairwise statistical significance Ankit Agrawal Iowa State University Follow this and additional works

### Introduction to Computation & Pairwise Alignment

Introduction to Computation & Pairwise Alignment Eunok Paek eunokpaek@hanyang.ac.kr Algorithm what you already know about programming Pan-Fried Fish with Spicy Dipping Sauce This spicy fish dish is quick

### Biosequence Alignment 徐鹰佐治亚大学生化系 吉林大学计算机学院

Biosequence Alignment 徐鹰佐治亚大学生化系 吉林大学计算机学院 Bio sequences Sequences could be DNA, protein and RNA sequences DNA sequence (consisting of 4 letters: A, C, G, T) Ccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgtg

### 08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega

BLAST Multiple Sequence Alignments: Clustal Omega What does basic BLAST do (e.g. what is input sequence and how does BLAST look for matches?) Susan Parrish McDaniel College Multiple Sequence Alignments

### Protein structure prediction. CS/CME/BioE/Biophys/BMI 279 Oct. 10 and 12, 2017 Ron Dror

Protein structure prediction CS/CME/BioE/Biophys/BMI 279 Oct. 10 and 12, 2017 Ron Dror 1 Outline Why predict protein structure? Can we use (pure) physics-based methods? Knowledge-based methods Two major

### Malik Hammoutène - SSC. Sequence Analysis

CONTENTS Introduction Analysis of individual sequences Secondary structure prediction Pairwise sequence comparison Database searching I: single heuristic algorithms Alignment and search statistics Multiple

### Multiple Sequence Alignment

Multiple equence lignment Four ami Khuri Dept of omputer cience an José tate University Multiple equence lignment v Progressive lignment v Guide Tree v lustalw v Toffee v Muscle v MFFT * 20 * 0 * 60 *

### Introduction to Bioinformatics Online Course: IBT

Introduction to Bioinformatics Online Course: IBT Multiple Sequence Alignment Building Multiple Sequence Alignment Lec1 Building a Multiple Sequence Alignment Learning Outcomes 1- Understanding Why multiple

### Sequence Bioinformatics. Multiple Sequence Alignment Waqas Nasir

Sequence Bioinformatics Multiple Sequence Alignment Waqas Nasir 2010-11-12 Multiple Sequence Alignment One amino acid plays coy; a pair of homologous sequences whisper; many aligned sequences shout out

### Probalign: Multiple sequence alignment using partition function posterior probabilities

Sequence Analysis Probalign: Multiple sequence alignment using partition function posterior probabilities Usman Roshan 1* and Dennis R. Livesay 2 1 Department of Computer Science, New Jersey Institute

### Biological Sequences and Hidden Markov Models CPBS7711

Biological Sequences and Hidden Markov Models CPBS7711 Sept 27, 2011 Sonia Leach, PhD Assistant Professor Center for Genes, Environment, and Health National Jewish Health sonia.leach@gmail.com Slides created

### An ant colony algorithm for multiple sequence alignment in bioinformatics

An ant colony algorithm for multiple sequence alignment in bioinformatics Jonathan Moss and Colin G. Johnson Computing Laboratory University of Kent at Canterbury Canterbury, Kent, CT2 7NF, England. C.G.Johnson@ukc.ac.uk

### DNA and protein databases. EMBL/GenBank/DDBJ database of nucleic acids

Database searches 1 DNA and protein databases EMBL/GenBank/DDBJ database of nucleic acids 2 DNA and protein databases EMBL/GenBank/DDBJ database of nucleic acids (cntd) 3 DNA and protein databases SWISS-PROT

### Homology and Information Gathering and Domain Annotation for Proteins

Homology and Information Gathering and Domain Annotation for Proteins Outline Homology Information Gathering for Proteins Domain Annotation for Proteins Examples and exercises The concept of homology The

### Statistical Machine Learning Methods for Bioinformatics IV. Neural Network & Deep Learning Applications in Bioinformatics

Statistical Machine Learning Methods for Bioinformatics IV. Neural Network & Deep Learning Applications in Bioinformatics Jianlin Cheng, PhD Department of Computer Science University of Missouri, Columbia

### Making sense of score statistics for sequence alignments

Marco Pagni is staff member at the Swiss Institute of Bioinformatics. His research interests include software development and handling of databases of protein domains. C. Victor Jongeneel is Director of

### G4120: Introduction to Computational Biology

ICB Fall 2003 G4120: Introduction to Computational Biology Oliver Jovanovic, Ph.D. Columbia University Department of Microbiology Copyright 2003 Oliver Jovanovic, All Rights Reserved. Bioinformatics and

### Predicting RNA Secondary Structure Using Profile Stochastic Context-Free Grammars and Phylogenic Analysis

Fang XY, Luo ZG, Wang ZH. Predicting RNA secondary structure using profile stochastic context-free grammars and phylogenic analysis. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 23(4): 582 589 July 2008

### Multiple Sequence Alignment (MSA) BIOL 7711 Computational Bioscience

Consortium for Comparative Genomics University of Colorado School of Medicine Multiple Sequence Alignment (MSA) BIOL 7711 Computational Bioscience Biochemistry and Molecular Genetics Computational Bioscience

### POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics - in deriving a phylogeny our goal is simply to reconstruct the historical relationships between a group of taxa. - before we review the

### Bioinformatics. Proteins II. - Pattern, Profile, & Structure Database Searching. Robert Latek, Ph.D. Bioinformatics, Biocomputing

Bioinformatics Proteins II. - Pattern, Profile, & Structure Database Searching Robert Latek, Ph.D. Bioinformatics, Biocomputing WIBR Bioinformatics Course, Whitehead Institute, 2002 1 Proteins I.-III.

### Algorithms in Bioinformatics: A Practical Introduction. Sequence Similarity

Algorithms in Bioinformatics: A Practical Introduction Sequence Similarity Earliest Researches in Sequence Comparison Doolittle et al. (Science, July 1983) searched for platelet-derived growth factor (PDGF)

### Supporting Text 1. Comparison of GRoSS sequence alignment to HMM-HMM and GPCRDB

Structure-Based Sequence Alignment of the Transmembrane Domains of All Human GPCRs: Phylogenetic, Structural and Functional Implications, Cvicek et al. Supporting Text 1 Here we compare the GRoSS alignment

### Lecture 5: September Time Complexity Analysis of Local Alignment

CSCI1810: Computational Molecular Biology Fall 2017 Lecture 5: September 21 Lecturer: Sorin Istrail Scribe: Cyrus Cousins Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes

### Example questions. Z:\summer_10_teaching\bioinfo\Beispiel_frage_bioinformatik.doc [1 / 5]

Example questions for Bioinformatics, first semester half Sommersemester 00 ote The schriftliche Klausur wurde auf deutsch geschrieben The questions will be based on material from the Übungen and the Lectures.

### HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM

I529: Machine Learning in Bioinformatics (Spring 2017) HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM Yuzhen Ye School of Informatics and Computing Indiana University, Bloomington

### The PRALINE online server: optimising progressive multiple alignment on the web

Computational Biology and Chemistry 27 (2003) 511 519 Software Note The PRALINE online server: optimising progressive multiple alignment on the web V.A. Simossis a,b, J. Heringa a, a Bioinformatics Unit,

### Subfamily HMMS in Functional Genomics. D. Brown, N. Krishnamurthy, J.M. Dale, W. Christopher, and K. Sjölander

Subfamily HMMS in Functional Genomics D. Brown, N. Krishnamurthy, J.M. Dale, W. Christopher, and K. Sjölander Pacific Symposium on Biocomputing 10:322-333(2005) SUBFAMILY HMMS IN FUNCTIONAL GENOMICS DUNCAN

### Grundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson

Grundlagen der Bioinformatik, SS 10, D. Huson, April 12, 2010 1 1 Introduction Grundlagen der Bioinformatik Summer semester 2010 Lecturer: Prof. Daniel Huson Office hours: Thursdays 17-18h (Sand 14, C310a)

### Substitution Matrices

C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E [1] Substitution matrices Sequence analysis 2006 Substitution Matrices Introduction to bioinformatics 2007 Lecture 8 C E N T R F

### Sequence Alignment. Johannes Starlinger

Sequence Alignment Johannes Starlinger his Lecture Approximate String Matching Edit distance and alignment Computing global alignments Local alignment Johannes Starlinger: Bioinformatics, Summer Semester

### A metric approach for. comparing DNA sequences

A metric approach for comparing DNA sequences H. Mora-Mora Department of Computer and Information Technology University of Alicante, Alicante, Spain M. Lloret-Climent Department of Applied Mathematics.

### Introduction to Hidden Markov Models for Gene Prediction ECE-S690

Introduction to Hidden Markov Models for Gene Prediction ECE-S690 Outline Markov Models The Hidden Part How can we use this for gene prediction? Learning Models Want to recognize patterns (e.g. sequence

### Pacific Symposium on Biocomputing 4: (1999)

OPTIMIZING SMITH-WATERMAN ALIGNMENTS ROLF OLSEN, TERENCE HWA Department of Physics, University of California at San Diego La Jolla, CA 92093-0319 email: rolf@cezanne.ucsd.edu, hwa@ucsd.edu MICHAEL L ASSIG

### Investigation 3: Comparing DNA Sequences to Understand Evolutionary Relationships with BLAST

Investigation 3: Comparing DNA Sequences to Understand Evolutionary Relationships with BLAST Introduction Bioinformatics is a powerful tool which can be used to determine evolutionary relationships and

### Algorithms in Bioinformatics I, ZBIT, Uni Tübingen, Daniel Huson, WS 2003/4 1

Algorithms in Bioinformatics I, ZBIT, Uni Tübingen, Daniel Huson, WS 2003/4 1 Algorithms in Bioinformatics I Winter Semester 2003/4, Center for Bioinformatics Tübingen, WSI-Informatik, Universität Tübingen

### On the optimality of the standard genetic code: the role of stop codons

On the optimality of the standard genetic code: the role of stop codons Sergey Naumenko 1*, Andrew Podlazov 1, Mikhail Burtsev 1,2, George Malinetsky 1 1 Department of Non-linear Dynamics, Keldysh Institute

### SCIENTIFIC EVIDENCE TO SUPPORT THE THEORY OF EVOLUTION. Using Anatomy, Embryology, Biochemistry, and Paleontology

SCIENTIFIC EVIDENCE TO SUPPORT THE THEORY OF EVOLUTION Using Anatomy, Embryology, Biochemistry, and Paleontology Scientific Fields Different fields of science have contributed evidence for the theory of

### Phylogenetic Tree Reconstruction

I519 Introduction to Bioinformatics, 2011 Phylogenetic Tree Reconstruction Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Evolution theory Speciation Evolution of new organisms is driven

### Bioinformatics. Can Keşmir Theoretical Biology/Bioinformatics, UU

Bioinformatics Can Keşmir Theoretical Biology/Bioinformatics, UU 2013 i c Utrecht University, 2013 Ebook publically available at: http://theory.bio.uu.nl/bpa/bioinf2013.pdf ii Contents 1 Introduction to

### Outline DP paradigm Discrete optimisation Viterbi algorithm DP: 0 1 Knapsack. Dynamic Programming. Georgy Gimel farb

Outline DP paradigm Discrete optimisation Viterbi algorithm DP: Knapsack Dynamic Programming Georgy Gimel farb (with basic contributions by Michael J. Dinneen) COMPSCI 69 Computational Science / Outline

### Quantitative Understanding in Biology Module III: Linear Difference Equations Lecture I: Introduction to Linear Difference Equations

Quantitative Understanding in Biology Module III: Linear Difference Equations Lecture I: Introduction to Linear Difference Equations Introductory Remarks This section of the course introduces dynamic systems;

### NUCLEOTIDE SUBSTITUTIONS AND THE EVOLUTION OF DUPLICATE GENES

Conery, J.S. and Lynch, M. Nucleotide substitutions and evolution of duplicate genes. Pacific Symposium on Biocomputing 6:167-178 (2001). NUCLEOTIDE SUBSTITUTIONS AND THE EVOLUTION OF DUPLICATE GENES JOHN

### Database search programs are one of the most important tools for analysis of DNA and protein sequences

MICROBIAL & COMPARATIVE GENOMICS Volume 1, Number 4, 1996 Mary Ann Liebert, Inc. Fast Comparison of a DNA Sequence with a Protein Sequence Database XIAOQIU HUANG ABSTRACT We describe a computer program,

### Evidence of Evolution

Evidence of Evolution Biogeography The Age of Earth and Fossils Ancient artiodactyl Modern whale Ancestors of Whales Ambulocetus could both swim in shallow water and walk on land. Rodhocetus probably spent

### BLAT The BLAST-Like Alignment Tool

Resource BLAT The BLAST-Like Alignment Tool W. James Kent Department of Biology and Center for Molecular Biology of RNA, University of California, Santa Cruz, Santa Cruz, California 95064, USA Analyzing

### EBI web resources II: Ensembl and InterPro

EBI web resources II: Ensembl and InterPro Yanbin Yin http://www.ebi.ac.uk/training/online/course/ 1 Homework 3 Go to http://www.ebi.ac.uk/interpro/training.htmland finish the second online training course

### Linear-Quadratic Optimal Control: Full-State Feedback

Chapter 4 Linear-Quadratic Optimal Control: Full-State Feedback 1 Linear quadratic optimization is a basic method for designing controllers for linear (and often nonlinear) dynamical systems and is actually

### Multiple Sequence Alignment

Multiple Sequence Alignment Multiple Alignment versus Pairwise Alignment Up until now we have only tried to align two sequences.! What about more than two? And what for?! A faint similarity between two

### Sparse, stable gene regulatory network recovery via convex optimization

Sparse, stable gene regulatory network recovery via convex optimization Arwen Meister June, 11 Gene regulatory networks Gene expression regulation allows cells to control protein levels in order to live