# An Introduction to Sequence Similarity ( Homology ) Searching

Size: px
Start display at page:

Transcription

1 An Introduction to Sequence Similarity ( Homology ) Searching Gary D. Stormo 1 UNIT Washington University, School of Medicine, St. Louis, Missouri ABSTRACT Homologous sequences usually have the same, or very similar, functions, so new sequences can be reliably assigned functions if homologous sequences with known functions can be identified. Homology is inferred based on sequence similarity, and many methods have been developed to identify sequences that have statistically significant similarity. This unit provides an overview of some of the basic issues in identifying similarity among sequences and points out other units in this chapter that describe specific programs that are useful for this task. Curr. Protoc. Bioinform. 27: C 2009 by John Wiley & Sons, Inc. Keywords: sequence similarity homology dynamic programming similarity-scoring matrices sequence alignment multiple alignment sequence evolution AN INTRODUCTION TO IDENTIFYING HOMOLOGOUS SEQUENCES Upon determining a new sequence, one common objective is to assign a function to that sequence, or perhaps multiple functions to various subsequences. Because homologous sequences, those related by common ancestry, generally have the same or very similar functions, their identification can be used to infer the function of the new sequence. The methods for identifying homologous sequences rely on finding similarities between sequences that are unlikely to happen by chance, thereby allowing the inference that the sequences have evolved from a common ancestor. The units in this Chapter all address aspects of this problem, finding significant similarity between new sequences and those that are currently in the databases of known sequences. This chapter introduction provides a primer on five fundamental issues related to similarity searches: (1) how to find the best alignment between two sequences; (2) methods for scoring an alignment; (3) speeding up database searches; (4) determining statistical significance; and (5) using information in multiple alignments to improve sensitivity. OPTIMAL SEQUENCE ALIGNMENTS Given any two sequences, there are an enormous number of ways they can be aligned if one allows for insertions and deletions (collectively referred to as indels or gaps) in the alignment. In order to assess whether the similarity score between two sequences is significant, one must obtain the highest possible score; otherwise one will underestimate the significance. Fortunately, there is a simple method, known as dynamic programming (DP), to determine the score of the best possible alignment between two sequences. Figure outlines the DP algorithm. The two sequences, A and B, are written across the top and down the side of the DP matrix. Each element of the matrix, S(i,j), represents the score of the best alignment from the beginning of each sequence up to residues a i,in sequence A, and b j, in sequence B. That element is determined by comparing just three numbers, which represent the three possible end points of those alignments: a i is aligned to b j ;a i is aligned to a gap in sequence B; b j is aligned to a gap in sequence A. The three numbers to be compared include the addition of the gap score, δ (which is always Current Protocols in Bioinformatics , September 2009 Published online September 2009 in Wiley Interscience ( DOI: / bi0301s27 Copyright C 2009 John Wiley & Sons, Inc

2 Seq B: b 1 Seq A: a 1 a 2 b 2 ai an b j b m S(i 1, j 1) S(i, j 1) S(i 1, j) S(i, j) S(i,j) max S(i 1, j 1) score(a i, b j ) S(i, j 1) S(i 1, j) Figure Dynamic programming algorithm for optimum sequence alignment. The two sequences are written across the top and along the right side of the matrix. The score of each element is determined by the simple rules shown for the enlarged section and described by the equations below it. (The top row and left column have special rules as described in the text.) The score of the best global alignment is the element S(n,m) and the alignment with that score can be obtained by backtracking through the matrix, determining the path that generated the score at each element. An Introduction to Sequence Similarity Searching negative because gaps are always penalized), to the elements directly above, S(i,j-1), and to the left, S(i-1,j), corresponding to the gaps in sequence B and A, respectively. The other number is the sum of the score of aligning a i with b j, which depends on what those residues are (see next section) and the score of the best alignment for the preceding subsequences, S(i-1,j-1). The top row and first column require special treatment, but the rule is still simple. Aligning a 1 with b 1 just gets score(a 1,b 1 ), but aligning a 1 with b j Current Protocols in Bioinformatics

3 requires j-1 gaps in sequence B, so S(1,j)=(j-1)δ + score(a 1,b j ). Once the top row and first column are initialized, the entire matrix can be filled following the simple rule shown in Figure 3.1.1, and the score of the best possible alignment of the two sequences is the element in the lower right corner of the matrix, S(n,m). The actual alignment that gives that score can be found by tracing backwards through the matrix, starting from the final element, to identify the preceding element that gave rise to each score. Keep in mind that there may be more than one alignment with the highest possible score, but most programs only show one of them. The result of the DP algorithm described above is a global alignment of the two sequences, where each residue of each sequence is included in the alignment and contributes to the score. However, often one is interested in the best local alignment, an alignment that may contain only a portion of one, or both, sequences. For example, two proteins may have a common domain that one would like to identify but the remaining parts of the sequences are unrelated. Alternatively, perhaps one is comparing a newly sequenced gene to the entire genome of another species and you only expect it to match the segment corresponding to the homologous gene. A simple modification of the basic DP algorithm can be used to identify the best local alignment, as first deduced by Smith and Waterman (1981). The simple change is that only positive values are kept in the DP matrix, which just means that zero is one of the values to be compared and the maximum taken for S(i,j). Now the score of the best alignment is not necessarily the element S(n,m), but may occur anywhere within the matrix. Once the highest score is found, the alignment that generated that score can be found by tracing backwards through the matrix, as before, but stopping when the first zero is encountered. One additional modification to the basic DP algorithm is often used. It is known from extensive studies of protein sequences that gaps of multiple residues are much more frequent than the same number of separate gaps. For example, a single deletion event might remove three residues at once, and that event is much more likely than for three separate events that each delete one residue. To capture this effect, one can use two gap penalties, one that counts the occurrence of a gap, usually referred to as the gap-opening penalty, and a different, lower one, that is proportional to the length of the gap. This can be easily incorporated into the DP algorithm with only a minor increase in the time required to find the best alignment (Fitch and Smith, 1983). The DP algorithm is guaranteed to give the highest-scoring alignment of the two sequences if the only evolutionary processes allowed are substitutions and indels. While real evolutionary processes also include duplications, transpositions, and inversions, those are not considered, as they cannot be handled by the DP approach and would drastically increase the time required to find the optimal alignment. However, it is generally seen that substitutions and indels are much more common than the other processes, so finding the best aligned considering only those processes will most frequently indicate the true optimum alignment. SCORING SEQUENCE SIMILARITY UNIT 3.5 discusses the construction of protein similarity matrices and considerations to take into account when choosing which to use for a specific search. This section briefly describes some of fundamental issues regarding similarity scores. The score of an alignment is the sum of the scores for each position in the alignment, as we saw in the DP algorithm for finding the highest-scoring alignment. For every pair of residues, which might be RNA, DNA, or protein, there is a score defined such that the more similar the sequences are, the higher the score. For RNA and DNA, the scores often just distinguish between matches, where the two residues are the same, and mismatches. However, sometimes it is useful to distinguish between mismatches that are transitions, changes from one purine or pyrimidine to the other, from those that are transversions, changes between a purine and a pyrimidine. For proteins, the situation is more complex and one needs a Current Protocols in Bioinformatics

4 score for all possible amino acid pairs and, ideally, one that reflects the probability of one amino acid being replaced by another during the course of evolution. Replacements are governed by two processes. The first is the mutation that changes one amino acid to another, and this will be much more likely for amino acids that have similar codons. For example, a single transition mutation can change a His to a Tyr (CAT to TAT), but three transversions would be necessary to change a His to a Met (CAT to ATG). The second process is that the mutation must survive selection for it to be observed, and this requires that the new amino acid can substitute for the previous one without disrupting the protein s function. This is more likely if the two amino acids have similar properties, such as size, charge, and hydrophobicity. Which properties are most important may depend on where the amino acid is within the protein sequence and what role it plays in determining the structure and function of the protein. Some positions in a protein may be highly constrained and only a few, perhaps only one, amino acids will work, whereas other positions may tolerate many different amino acids and therefore be highly variable in related protein sequences. This implies that one would really like to have different scoring matrices depending on the constraints at particular positions in the protein, but this would require many more parameters, and often one is lacking the information needed to choose the appropriate matrix. This problem can be partially addressed using multiple alignments of protein families, as described below. However, for general database searches, where one hopes to identify homologous sequences in the collection of known sequences, a general similarity-scoring matrix, which is the average over various types of constrained positions, is often utilized. Most commonly used similarity matrices are empirically based on reliable alignments of protein families. For instance, the BLOSUM matrices (UNIT 3.5) are based on the alignments in the BLOCKS protein family alignments (UNIT 2.2). These are reliable alignments of protein domains across many different protein families. The score assigned to a specific pair of residues is the log-odds of their co-occurrence in aligned positions compared to what would be expected by chance given their individual frequencies. So for two amino acids, a and b, the similarity score is: score ( ab, )= ln f( a, b) f()() a f b where f(a,b) is the observed frequency at which those amino acids are aligned and f(a) and f(b) are their frequencies in the entire database. If the two amino acids occur together at a frequency expected by chance, which is the product of the independent frequencies, then the score is 0. Amino acids that occur together more often than expected by chance get positive scores and ones that occur less frequently get negative scores. The score for aligning an amino acid with itself is always the highest score for that amino acid, as it is more likely to occur than any substitution. Positive-scoring pairs represent amino acids with similar properties, and they are often observed to substitute for one another, whereas negative scoring pairs are dissimilar amino acids that are less likely to substitute for each other. It is important to note that the expected score of a random alignment is negative. In addition to the amino acid similarity scores, one also needs to have penalties for indels in the alignment, and those values are also based empirically on many reliable alignments over many different protein families. FAST SEARCHING METHODS An Introduction to Sequence Similarity Searching The DP method described above is guaranteed to find the highest-scoring alignment between two sequences. The time it takes to complete the alignment is proportional to the number of DP matrix elements to compute, which is the product of the sequence lengths. While this is very fast for comparing any two sequences of reasonable length, it is Current Protocols in Bioinformatics

5 not practical for searching the current sequence databases, which contain many millions of sequences and many billions of residues. Therefore methods have been developed that can search entire databases much faster. While these methods do not guarantee finding the absolute best alignments, they have been finely optimized so that they have very high sensitivities and generally do obtain the optimal, or near-optimal, alignments. The most commonly used method is BLAST (Altschul et al., 1990; UNITS 3.3, 3.4 & 3.11). BLAST is convenient to use through the online service (UNITS 3.3 & 3.4), has access to the major DNA and protein databases, and provides convenient tools for displaying and analyzing the output. One can also download BLAST to a local computer for in-house use (UNIT 3.11). But most importantly BLAST has been shown to be very sensitive, giving results nearly equivalent to running the Smith/Waterman DP algorithm on the whole database but in a small fraction of the time. As a brief and approximate description of how BLAST works, consider the DP matrix of Figure Most of the time is spent determining the scores of elements that do not contribute to the final, optimum alignment. BLAST achieves its speed by eliminating most of the computations and focusing its effort on the highest-scoring paths through the matrix. It does this by creating an index of the words, short subsequences, from the query sequence, and high-scoring matches to those words can be identified very quickly in the database. Those initial matches, or hits, are extended using DP but with cutoffs employed so that when any extension becomes unlikely to lead to a significant match it is terminated. The matches that do lead to significant alignments are extended as far as possible to give a result essentially equivalent to that obtained by the full DP algorithm. Since most sequences in the database, and most of the possible alignments even to the most similar sequences, are eliminated quickly from further consideration, BLAST achieves an enormous increase in speed with only a small loss of sensitivity. THE SIGNIFICANCE OF AN ALIGNMENT SCORE Obtaining a score for an alignment does not, by itself, tell you whether it is significant. One needs to determine what is the probability of observing such a score by chance, given the scoring system used and the lengths of the sequences being compared. It is important to realize that the distribution of optimum alignment scores is not normal. If one were to consider the scores of all possible alignments, one would expect a normal distribution. However, we are only considering the highest-scoring alignments to each sequence in the database, and the distribution of those maximum, or extreme, values will be quite different from the distribution of scores for all alignments. Consider the following simple example. If one throws two dice, the score distribution follows the binomial distribution, and the probability of getting a 12 is 1/36. But if one throws two pair of dice, and records only the highest score, the distribution is quite different, with the probability of getting a 12 nearly twice as high, 71/36 2 to be precise. When searching a large database with a query sequence, one is interested only in the highest-scoring alignments, those that are mostly likely to represent homologous sequences. However, to answer the question of how high a score is required to be unlikely to have occurred by chance, one must know the distribution of highest scores expected by chance. Karlin and Altschul (1990) showed that for alignments without gaps, the distribution of highest scores follows a Gumbel distribution, a type of extreme value distribution. Furthermore, they showed how to compute the dependency of that score distribution on the scoring system that is used and the length of the query and database sequences. In addition, while their theory only holds analytically for ungapped alignments, empirical evidence indicates that it holds quite well for gapped alignments too. This allows one to compute a statistical significance for the best scoring alignments in a database search and from that infer whether the best matches are likely to be homologous sequences or could have arisen by chance. The BLAST program computes these values as part of the output, with Current Protocols in Bioinformatics

6 most people focusing on the E-value, which is the number matches with equal or greater scores that are expected by chance. When E-values are much less than one, they are essentially the same as p-values, the probability of an equal or greater score by chance, but E-values can exceed one, whereas p-values cannot. MAKING AND USING MULTIPLE SEQUENCE ALIGNMENTS The previous sections of this unit have focused on aligning pairs of sequences, or more generally searching an entire database for high-scoring alignments to a query sequence, which requires many pairwise comparisons. However, if one has several members of a protein family, an alignment of those sequences can provide additional information about the sequence constraints at different positions of the proteins. For example, some positions may be critical for the function of those proteins and perhaps only one, or a few, amino acids are allowed at those positions. If that is the case, then a general similarityscoring matrix is not appropriate and one should utilize the information available from the multiple alignments to constrain the search for homologous proteins. The two main issues are how the multiple alignments are made and how they are used to identify new members of the protein family. Several units in this chapter and the previous one address various aspects of the construction and use of multiple sequence alignments. UNIT 3.7 provides an overview of multiple sequence alignment. One can generate optimal multiple alignments by extending the DP method to multiple sequences, following the same rules as described above but expanded to multiple dimensions, one for each sequence. However, the time required to fill in the DP matrix is proportional to the product of the length of all the sequences, which becomes impractical for more than a few sequences (Lipman et al., 1989). Practical methods for multiple sequence alignment use one of two approaches. One is a progressive alignment strategy where pairs of sequences are aligned and then new sequences are added to that alignment, or alignments are aligned to each other, but all steps only involve pairwise comparisons that can be accomplished efficiently using the DP strategy described. The Pileup program from GCG (UNIT 3.6) uses that strategy, as does the ClustalW program (UNIT 2.3), which is probably the most popular method for constructing multiple alignments. Psi-BLAST also uses a similar iterative strategy within the BLAST suite of programs (UNIT 3.4) to build multiple alignments from the matches identified by BLAST. The other strategy uses an interative refinement approach, where an initial approximate alignment is built and then sequences are realigned to the resulting model until the method converges to a final multiple alignment. The method is based on an expectation maximization (EM)algorithm,similar to that used in MEME (UNIT 2.4) but allowing gaps in the alignment. The EM algorithm is at the heart of the Hidden Markov Model (HMM) approach to protein family models first developed by the Haussler group (Krogh et al., 1994; Eddy, 1996; overview in UNIT 3.7 & APPENDIX 3A). In an HMM each position of the alignment is represented by a probability distribution over all of the amino acids and there are probabilities assigned for insertions and deletions. This model captures, in a probabilistic fashion, the natural variability and constraints that are observed in a protein family. The T-Coffee approach (UNIT 3.8) is somewhat different in that it combines, through a consistency-based scoring system, pairwise alignments that can be obtained from a variety of different and independent approaches. For example, if structural information is available and useful for determining the correct alignment that can be incorporated along with other types of information. An Introduction to Sequence Similarity Searching Once a multiple alignment is built, it can be used to identify new sequences that are members of the family. Gribskov et al. (1987) first described such an approach where the multiple alignment is converted to a profile, and new sequences are scored against it using a DP algorithm and a sum-of-pairs method. This is equivalent to finding the best Current Protocols in Bioinformatics

7 alignment of the new sequence to the multiple sequence alignment, where the scoring is the average similarity score to each sequence in the alignment. This is also how ClustalW works internally, as it is building up the multiple alignment (UNIT 2.3). A new sequence can also be aligned to an HMM for a protein family using a version of the DP algorithm, the same method that is used in the iterative realignment for building the HMM. Pfam (UNIT 2.5) is a database of protein families represented as HMMs, and a new sequence can be compared to the database to determine how similar it is to each family and if it can be inferred to belong to any of the families. The InterPro resource of databases and analysis tools (UNIT 2.7) also facilitates the identification of protein families that may be homologous to a new sequence of interest. SUMMARY When one obtains a new sequence and wants to assign a function to it, identifying homologous sequences with known function is a reliable strategy. Two complementary approaches can be easily employed and are generally very useful. The first is to search the databases of known sequences, using programs like BLAST, to determine if there are any that are sufficiently similar to be homologous, and if those sequences have known functions such that information can be transferred with reasonable reliability. One can also search the databases of known protein families that include the extra information from multiple sequence alignments, such as Pfam and the InterPro databases. While those protein family databases are less comprehensive than the single sequence databases, simply because many sequences do not have known family membership, they can be more sensitive at identifying distantly related sequences that may be missed without knowledge of the specific constraints associated with specific families. Therefore, a combined approach that utilizes both types of resources can be most effective in identifying homologous sequences for a new sequence of interest. LITERATURE CITED Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J Basic local alignment search tool. J. Mol. Biol. 215: Eddy, S.R Hidden Markov models. Curr. Opin. Struct. Biol. 6: Fitch, W. and Smith, T Optimal sequence alignments. Proc. Natl. Acad. Sci. U.S.A. 80: Gribskov, M., McLachlan, A.D., and Eisenberg, D Profile analysis: Detection of distantly related proteins. Proc. Natl. Acad. Sci. U.S.A. 84: Karlin, S. and Altschul, S.F Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl. Acad. Sci. U.S.A. 87: Krogh, A., Brown, M., Mian, I.S., Sjölander, K., and Haussler, D Hidden Markov models in computational biology. Applications to protein modeling. J. Mol. Biol. 235: Lipman, D.J., Altschul, S.F., and Kececioglu, J.D A tool for multiple sequence alignment. Proc. Natl. Acad. Sci. U.S.A. 86: Smith, T.F. and Waterman, M.S Identification of common molecular subsequences. J. Mol. Biol. 147: Current Protocols in Bioinformatics

### In-Depth Assessment of Local Sequence Alignment

2012 International Conference on Environment Science and Engieering IPCBEE vol.3 2(2012) (2012)IACSIT Press, Singapoore In-Depth Assessment of Local Sequence Alignment Atoosa Ghahremani and Mahmood A.

### Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot

### Sequence Alignment Techniques and Their Uses

Sequence Alignment Techniques and Their Uses Sarah Fiorentino Since rapid sequencing technology and whole genomes sequencing, the amount of sequence information has grown exponentially. With all of this

### THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Exp 11- THEORY Sequence Alignment is a process of aligning two sequences to achieve maximum levels of identity between them. This help to derive functional, structural and evolutionary relationships between

### 3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT.03.239 25.09.2012 SEQUENCE ANALYSIS IS IMPORTANT FOR... Prediction of function Gene finding the process of identifying the regions of genomic DNA that encode

### Local Alignment Statistics

Local Alignment Statistics Stephen Altschul National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, MD Central Issues in Biological Sequence Comparison

### Single alignment: Substitution Matrix. 16 march 2017

Single alignment: Substitution Matrix 16 march 2017 BLOSUM Matrix BLOSUM Matrix [2] (Blocks Amino Acid Substitution Matrices ) It is based on the amino acids substitutions observed in ~2000 conserved block

### Sequence Database Search Techniques I: Blast and PatternHunter tools

Sequence Database Search Techniques I: Blast and PatternHunter tools Zhang Louxin National University of Singapore Outline. Database search 2. BLAST (and filtration technique) 3. PatternHunter (empowered

### Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Sequence Alignment: A General Overview COMP 571 - Fall 2010 Luay Nakhleh, Rice University Life through Evolution All living organisms are related to each other through evolution This means: any pair of

### A profile-based protein sequence alignment algorithm for a domain clustering database

A profile-based protein sequence alignment algorithm for a domain clustering database Lin Xu,2 Fa Zhang and Zhiyong Liu 3, Key Laboratory of Computer System and architecture, the Institute of Computing

### MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME

MATHEMATICAL MODELING AND THE HUMAN GENOME Hilary S. Booth Australian National University, Australia Keywords: Human genome, DNA, bioinformatics, sequence analysis, evolution. Contents 1. Introduction:

### Sequence Alignment (chapter 6)

Sequence lignment (chapter 6) he biological problem lobal alignment Local alignment Multiple alignment Introduction to bioinformatics, utumn 6 Background: comparative genomics Basic question in biology:

### Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD Department of Computer Science University of Missouri 2008 Free for Academic

### Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD William and Nancy Thompson Missouri Distinguished Professor Department

### Bioinformatics and BLAST

Bioinformatics and BLAST Overview Recap of last time Similarity discussion Algorithms: Needleman-Wunsch Smith-Waterman BLAST Implementation issues and current research Recap from Last Time Genome consists

### Algorithms in Bioinformatics

Algorithms in Bioinformatics Sami Khuri Department of omputer Science San José State University San José, alifornia, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri Pairwise Sequence Alignment Homology

### Introduction to Bioinformatics

Introduction to Bioinformatics Jianlin Cheng, PhD Department of Computer Science Informatics Institute 2011 Topics Introduction Biological Sequence Alignment and Database Search Analysis of gene expression

### Grundlagen der Bioinformatik, SS 08, D. Huson, May 2,

Grundlagen der Bioinformatik, SS 08, D. Huson, May 2, 2008 39 5 Blast This lecture is based on the following, which are all recommended reading: R. Merkl, S. Waack: Bioinformatik Interaktiv. Chapter 11.4-11.7

### CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools Giri Narasimhan ECS 254; Phone: x3748 giri@cis.fiu.edu www.cis.fiu.edu/~giri/teach/bioinfs15.html Describing & Modeling Patterns

### CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

CONCEPT OF SEQUENCE COMPARISON Natapol Pornputtapong 18 January 2018 SEQUENCE ANALYSIS - A ROSETTA STONE OF LIFE Sequence analysis is the process of subjecting a DNA, RNA or peptide sequence to any of

### 1.5 Sequence alignment

1.5 Sequence alignment The dramatic increase in the number of sequenced genomes and proteomes has lead to development of various bioinformatic methods and algorithms for extracting information (data mining)

### Biochemistry 324 Bioinformatics. Pairwise sequence alignment

Biochemistry 324 Bioinformatics Pairwise sequence alignment How do we compare genes/proteins? When we have sequenced a genome, we try and identify the function of unknown genes by finding a similar gene

### Week 10: Homology Modelling (II) - HHpred

Week 10: Homology Modelling (II) - HHpred Course: Tools for Structural Biology Fabian Glaser BKU - Technion 1 2 Identify and align related structures by sequence methods is not an easy task All comparative

### Background: comparative genomics. Sequence similarity. Homologs. Similarity vs homology (2) Similarity vs homology. Sequence Alignment (chapter 6)

Sequence lignment (chapter ) he biological problem lobal alignment Local alignment Multiple alignment Background: comparative genomics Basic question in biology: what properties are shared among organisms?

### Sequence analysis and comparison

The aim with sequence identification: Sequence analysis and comparison Marjolein Thunnissen Lund September 2012 Is there any known protein sequence that is homologous to mine? Are there any other species

### Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1

Tiffany Samaroo MB&B 452a December 8, 2003 Take Home Final Topic 1 Prior to 1970, protein and DNA sequence alignment was limited to visual comparison. This was a very tedious process; even proteins with

### O 3 O 4 O 5. q 3. q 4. Transition

Hidden Markov Models Hidden Markov models (HMM) were developed in the early part of the 1970 s and at that time mostly applied in the area of computerized speech recognition. They are first described in

### Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Bioinformática Sequence Alignment Pairwise Sequence Alignment Universidade da Beira Interior (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) 1 16/3/29 & 23/3/29 27/4/29 Outline

### Introduction to sequence alignment. Local alignment the Smith-Waterman algorithm

Lecture 2, 12/3/2003: Introduction to sequence alignment The Needleman-Wunsch algorithm for global sequence alignment: description and properties Local alignment the Smith-Waterman algorithm 1 Computational

### Biologically significant sequence alignments using Boltzmann probabilities

Biologically significant sequence alignments using Boltzmann probabilities P. Clote Department of Biology, Boston College Gasson Hall 416, Chestnut Hill MA 02467 clote@bc.edu May 7, 2003 Abstract In this

### Sequence analysis and Genomics

Sequence analysis and Genomics October 12 th November 23 rd 2 PM 5 PM Prof. Peter Stadler Dr. Katja Nowick Katja: group leader TFome and Transcriptome Evolution Bioinformatics group Paul-Flechsig-Institute

### Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm Alignment scoring schemes and theory: substitution matrices and gap models 1 Local sequence alignments Local sequence alignments are necessary

### Multiple sequence alignment

Multiple sequence alignment Multiple sequence alignment: today s goals to define what a multiple sequence alignment is and how it is generated; to describe profile HMMs to introduce databases of multiple

### Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

Sequence Alignments Dynamic programming approaches, scoring, and significance Lucy Skrabanek ICB, WMC January 31, 213 Sequence alignment Compare two (or more) sequences to: Find regions of conservation

### Computational Biology

Computational Biology Lecture 6 31 October 2004 1 Overview Scoring matrices (Thanks to Shannon McWeeney) BLAST algorithm Start sequence alignment 2 1 What is a homologous sequence? A homologous sequence,

### Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Bioinformatics II Probability and Statistics Universität Zürich and ETH Zürich Spring Semester 2009 Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM) Dr Fraser Daly adapted from

### Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Substitution score matrices, PAM, BLOSUM Needleman-Wunsch algorithm (Global) Smith-Waterman algorithm (Local) BLAST (local, heuristic) E-value

### Homology Modeling. Roberto Lins EPFL - summer semester 2005

Homology Modeling Roberto Lins EPFL - summer semester 2005 Disclaimer: course material is mainly taken from: P.E. Bourne & H Weissig, Structural Bioinformatics; C.A. Orengo, D.T. Jones & J.M. Thornton,

### Pairwise & Multiple sequence alignments

Pairwise & Multiple sequence alignments Urmila Kulkarni-Kale Bioinformatics Centre 411 007 urmila@bioinfo.ernet.in Basis for Sequence comparison Theory of evolution: gene sequences have evolved/derived

### Similarity searching summary (2)

Similarity searching / sequence alignment summary Biol4230 Thurs, February 22, 2016 Bill Pearson wrp@virginia.edu 4-2818 Pinn 6-057 What have we covered? Homology excess similiarity but no excess similarity

### Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Proteomics Chapter 5. Proteomics and the analysis of protein sequence Ⅱ 1 Pairwise similarity searching (1) Figure 5.5: manual alignment One of the amino acids in the top sequence has no equivalent and

### Optimization of a New Score Function for the Detection of Remote Homologs

PROTEINS: Structure, Function, and Genetics 41:498 503 (2000) Optimization of a New Score Function for the Detection of Remote Homologs Maricel Kann, 1 Bin Qian, 2 and Richard A. Goldstein 1,2 * 1 Department

### An Introduction to Bioinformatics Algorithms Hidden Markov Models

Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

### Hidden Markov Models

Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

### Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment Introduction to Bioinformatics online course : IBT Jonathan Kayondo Learning Objectives Understand

### InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

Lecture 5 Alignment I. Introduction. For sequence data, the process of generating an alignment establishes positional homologies; that is, alignment provides the identification of homologous phylogenetic

### Collected Works of Charles Dickens

Collected Works of Charles Dickens A Random Dickens Quote If there were no bad people, there would be no good lawyers. Original Sentence It was a dark and stormy night; the night was dark except at sunny

### BLAST. Varieties of BLAST

BLAST Basic Local Alignment Search Tool (1990) Altschul, Gish, Miller, Myers, & Lipman Uses short-cuts or heuristics to improve search speed Like speed-reading, does not examine every nucleotide of database

### Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Sequence Analysis 17: lecture 5 Substitution matrices Multiple sequence alignment Substitution matrices Used to score aligned positions, usually of amino acids. Expressed as the log-likelihood ratio of

### Large-Scale Genomic Surveys

Bioinformatics Subtopics Fold Recognition Secondary Structure Prediction Docking & Drug Design Protein Geometry Protein Flexibility Homology Modeling Sequence Alignment Structure Classification Gene Prediction

### Quantifying sequence similarity

Quantifying sequence similarity Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, February 16 th 2016 After this lecture, you can define homology, similarity, and identity

### Tools and Algorithms in Bioinformatics

Tools and Algorithms in Bioinformatics GCBA815, Fall 2013 Week3: Blast Algorithm, theory and practice Babu Guda, Ph.D. Department of Genetics, Cell Biology & Anatomy Bioinformatics and Systems Biology

### Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

Pairwise Alignment Guan-Shieng Huang shieng@ncnu.edu.tw Dept. of CSIE, NCNU Pairwise Alignment p.1/55 Approach 1. Problem definition 2. Computational method (algorithms) 3. Complexity and performance Pairwise

### Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)

Alignment principles and homology searching using (PSI-)BLAST Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU) http://ibivu.cs.vu.nl Bioinformatics Nothing in Biology makes sense except in

### CSE : Computational Issues in Molecular Biology. Lecture 6. Spring 2004

CSE 397-497: Computational Issues in Molecular Biology Lecture 6 Spring 2004-1 - Topics for today Based on premise that algorithms we've studied are too slow: Faster method for global comparison when sequences

### BLAST: Basic Local Alignment Search Tool

.. CSC 448 Bioinformatics Algorithms Alexander Dekhtyar.. (Rapid) Local Sequence Alignment BLAST BLAST: Basic Local Alignment Search Tool BLAST is a family of rapid approximate local alignment algorithms[2].

### BLAST: Target frequencies and information content Dannie Durand

Computational Genomics and Molecular Biology, Fall 2016 1 BLAST: Target frequencies and information content Dannie Durand BLAST has two components: a fast heuristic for searching for similar sequences

### Chapter 7: Rapid alignment methods: FASTA and BLAST

Chapter 7: Rapid alignment methods: FASTA and BLAST The biological problem Search strategies FASTA BLAST Introduction to bioinformatics, Autumn 2007 117 BLAST: Basic Local Alignment Search Tool BLAST (Altschul

### Motivating the need for optimal sequence alignments...

1 Motivating the need for optimal sequence alignments... 2 3 Note that this actually combines two objectives of optimal sequence alignments: (i) use the score of the alignment o infer homology; (ii) use

Introduction to Bioinformatics Prof. Dr. Nizamettin AYDIN naydin@yildiz.edu.tr Multiple Sequence Alignment Outline Multiple sequence alignment introduction to msa methods of msa progressive global alignment

### First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences

First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences 140.638 where do sequences come from? DNA is not hard to extract (getting DNA from a

### Pairwise sequence alignments

Pairwise sequence alignments Volker Flegel VI, October 2003 Page 1 Outline Introduction Definitions Biological context of pairwise alignments Computing of pairwise alignments Some programs VI, October

### Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre

Bioinformatics Scoring Matrices David Gilbert Bioinformatics Research Centre www.brc.dcs.gla.ac.uk Department of Computing Science, University of Glasgow Learning Objectives To explain the requirement

### Pairwise sequence alignment

Department of Evolutionary Biology Example Alignment between very similar human alpha- and beta globins: GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKL G+ +VK+HGKKV A+++++AH+D++ +++++LS+LH KL GNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKL

### Basic Local Alignment Search Tool

Basic Local Alignment Search Tool Alignments used to uncover homologies between sequences combined with phylogenetic studies o can determine orthologous and paralogous relationships Local Alignment uses

### SEQUENCE alignment is an underlying application in the

194 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 8, NO. 1, JANUARY/FEBRUARY 2011 Pairwise Statistical Significance of Local Sequence Alignment Using Sequence-Specific and Position-Specific

Feinberg Graduate School of the Weizmann Institute of Science Advanced topics in bioinformatics Shmuel Pietrokovski & Eitan Rubin Spring 2003 Course WWW site: http://bioinformatics.weizmann.ac.il/courses/atib

### 7.36/7.91 recitation CB Lecture #4

7.36/7.91 recitation 2-19-2014 CB Lecture #4 1 Announcements / Reminders Homework: - PS#1 due Feb. 20th at noon. - Late policy: ½ credit if received within 24 hrs of due date, otherwise no credit - Answer

### Alignment & BLAST. By: Hadi Mozafari KUMS

Alignment & BLAST By: Hadi Mozafari KUMS SIMILARITY - ALIGNMENT Comparison of primary DNA or protein sequences to other primary or secondary sequences Expecting that the function of the similar sequence

### Sequence Alignment: Scoring Schemes. COMP 571 Luay Nakhleh, Rice University

Sequence Alignment: Scoring Schemes COMP 571 Luay Nakhleh, Rice University Scoring Schemes Recall that an alignment score is aimed at providing a scale to measure the degree of similarity (or difference)

### bioinformatics 1 -- lecture 7

bioinformatics 1 -- lecture 7 Probability and conditional probability Random sequences and significance (real sequences are not random) Erdos & Renyi: theoretical basis for the significance of an alignment

### Moreover, the circular logic

Moreover, the circular logic How do we know what is the right distance without a good alignment? And how do we construct a good alignment without knowing what substitutions were made previously? ATGCGT--GCAAGT

### Study and Implementation of Various Techniques Involved in DNA and Protein Sequence Analysis

Study and Implementation of Various Techniques Involved in DNA and Protein Sequence Analysis Kumud Joseph Kujur, Sumit Pal Singh, O.P. Vyas, Ruchir Bhatia, Varun Singh* Indian Institute of Information

### Lecture Notes: Markov chains

Computational Genomics and Molecular Biology, Fall 5 Lecture Notes: Markov chains Dannie Durand At the beginning of the semester, we introduced two simple scoring functions for pairwise alignments: a similarity

### Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

Protein Bioinformatics Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet rickard.sandberg@ki.se sandberg.cmb.ki.se Outline Protein features motifs patterns profiles signals 2 Protein

### Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Sequence comparison: Score matrices Genome 559: Introduction to Statistical and omputational Genomics Prof James H Thomas FYI - informal inductive proof of best alignment path onsider the last step in

### Pairwise Sequence Alignment

Introduction to Bioinformatics Pairwise Sequence Alignment Prof. Dr. Nizamettin AYDIN naydin@yildiz.edu.tr Outline Introduction to sequence alignment pair wise sequence alignment The Dot Matrix Scoring

### CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I) Contents Alignment algorithms Needleman-Wunsch (global alignment) Smith-Waterman (local alignment) Heuristic algorithms FASTA BLAST

### Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

Page Hidden Markov models and multiple sequence alignment Russ B Altman BMI 4 CS 74 Some slides borrowed from Scott C Schmidler (BMI graduate student) References Bioinformatics Classic: Krogh et al (994)

### Bioinformatics for Computer Scientists (Part 2 Sequence Alignment) Sepp Hochreiter

Bioinformatics for Computer Scientists (Part 2 Sequence Alignment) Institute of Bioinformatics Johannes Kepler University, Linz, Austria Sequence Alignment 2. Sequence Alignment Sequence Alignment 2.1

### C E N T R. Introduction to bioinformatics 2007 E B I O I N F O R M A T I C S V U F O R I N T. Lecture 5 G R A T I V. Pair-wise Sequence Alignment

C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U Introduction to bioinformatics 2007 Lecture 5 Pair-wise Sequence Alignment Bioinformatics Nothing in Biology makes sense except in

### Practical considerations of working with sequencing data

Practical considerations of working with sequencing data File Types Fastq ->aligner -> reference(genome) coordinates Coordinate files SAM/BAM most complete, contains all of the info in fastq and more!

### Scoring Matrices. Shifra Ben-Dor Irit Orr

Scoring Matrices Shifra Ben-Dor Irit Orr Scoring matrices Sequence alignment and database searching programs compare sequences to each other as a series of characters. All algorithms (programs) for comparison

### Overview Multiple Sequence Alignment

Overview Multiple Sequence Alignment Inge Jonassen Bioinformatics group Dept. of Informatics, UoB Inge.Jonassen@ii.uib.no Definition/examples Use of alignments The alignment problem scoring alignments

### Hidden Markov Models. Main source: Durbin et al., Biological Sequence Alignment (Cambridge, 98)

Hidden Markov Models Main source: Durbin et al., Biological Sequence Alignment (Cambridge, 98) 1 The occasionally dishonest casino A P A (1) = P A (2) = = 1/6 P A->B = P B->A = 1/10 B P B (1)=0.1... P

### CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison

CMPS 6630: Introduction to Computational Biology and Bioinformatics Structure Comparison Protein Structure Comparison Motivation Understand sequence and structure variability Understand Domain architecture

### BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

BLAST Database Searching BME 110: CompBio Tools Todd Lowe April 8, 2010 Admin Reading: Read chapter 7, and the NCBI Blast Guide and tutorial http://www.ncbi.nlm.nih.gov/blast/why.shtml Read Chapter 8 for

### Fundamentals of database searching

Fundamentals of database searching Aligning novel sequences with previously characterized genes or proteins provides important insights into their common attributes and evolutionary origins. The principles

### Similarity or Identity? When are molecules similar?

Similarity or Identity? When are molecules similar? Mapping Identity A -> A T -> T G -> G C -> C or Leu -> Leu Pro -> Pro Arg -> Arg Phe -> Phe etc If we map similarity using identity, how similar are

### Lecture 5,6 Local sequence alignment

Lecture 5,6 Local sequence alignment Chapter 6 in Jones and Pevzner Fall 2018 September 4,6, 2018 Evolution as a tool for biological insight Nothing in biology makes sense except in the light of evolution

### Sequence comparison: Score matrices

Sequence comparison: Score matrices http://facultywashingtonedu/jht/gs559_2013/ Genome 559: Introduction to Statistical and omputational Genomics Prof James H Thomas FYI - informal inductive proof of best

### Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM).

1 Bioinformatics: In-depth PROBABILITY & STATISTICS Spring Semester 2011 University of Zürich and ETH Zürich Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM). Dr. Stefanie Muff

### Pairwise sequence alignments. Vassilios Ioannidis (From Volker Flegel )

Pairwise sequence alignments Vassilios Ioannidis (From Volker Flegel ) Outline Introduction Definitions Biological context of pairwise alignments Computing of pairwise alignments Some programs Importance

### MPIPairwiseStatSig: Parallel Pairwise Statistical Significance Estimation of Local Sequence Alignment

MPIPairwiseStatSig: Parallel Pairwise Statistical Significance Estimation of Local Sequence Alignment Ankit Agrawal, Sanchit Misra, Daniel Honbo, Alok Choudhary Dept. of Electrical Engg. and Computer Science

### 17 Non-collinear alignment Motivation A B C A B C A B C A B C D A C. This exposition is based on:

17 Non-collinear alignment This exposition is based on: 1. Darling, A.E., Mau, B., Perna, N.T. (2010) progressivemauve: multiple genome alignment with gene gain, loss and rearrangement. PLoS One 5(6):e11147.

### Introduction to protein alignments

Introduction to protein alignments Comparative Analysis of Proteins Experimental evidence from one or more proteins can be used to infer function of related protein(s). Gene A Gene X Protein A compare

### Sequence Comparison. mouse human

Sequence Comparison Sequence Comparison mouse human Why Compare Sequences? The first fact of biological sequence analysis In biomolecular sequences (DNA, RNA, or amino acid sequences), high sequence similarity

### INFORMATION-THEORETIC BOUNDS OF EVOLUTIONARY PROCESSES MODELED AS A PROTEIN COMMUNICATION SYSTEM. Liuling Gong, Nidhal Bouaynaya and Dan Schonfeld

INFORMATION-THEORETIC BOUNDS OF EVOLUTIONARY PROCESSES MODELED AS A PROTEIN COMMUNICATION SYSTEM Liuling Gong, Nidhal Bouaynaya and Dan Schonfeld University of Illinois at Chicago, Dept. of Electrical