Database search programs are one of the most important tools for analysis of DNA and protein sequences

Size: px
Start display at page:

Download "Database search programs are one of the most important tools for analysis of DNA and protein sequences"

Transcription

1 MICROBIAL & COMPARATIVE GENOMICS Volume 1, Number 4, 1996 Mary Ann Liebert, Inc. Fast Comparison of a DNA Sequence with a Protein Sequence Database XIAOQIU HUANG ABSTRACT We describe a computer program, named DNA-Protein Search (DPS), for comparing a megabase DNA sequence with a protein sequence database. The DPS program addresses the problems of frameshifts and introns in the DNA sequence. The DPS program was used to compare each of the following sequences with the Swiss-Prot database: the 1.8-megabase sequence of the Haemophilus influenzae Rd genome, the 0.58-megabase sequence of the Mycoplasma genitalium genome, and the 0.56-megabase sequence of Saccharomyces cerevisiae chromosome VIII. The comparisons found new regions that are similar to protein sequences. The sensitivity of DPS was evaluated using as test data the known coding regions of the three DNA sequences. The results demonstrate that the DPS program is a useful tool for finding the coding regions of the DNA sequence. The DPS program uses an order of magnitude less computer memory and is several times faster than the BLASTX program. INTRODUCTION Database search programs are one of the most important tools for analysis of DNA and protein sequences (Pearson and Lipman, 1988; Altschul et al., 1990). One type of database search is to compare a newly determined DNA sequence with a protein database in order to find the coding regions of the DNA sequence (Johnston et al., 1994; Fleischmann et al., 1995; Fraser et al., 1995). Existing database search programs translate the DNA sequence in all six reading frames and rapidly compare each translated sequence with each protein sequence in the database. The TFASTA program of Pearson and Lipman (1988) computes a high-scoring local alignment between the translated sequence and the protein sequence. The BLASTX program computes high-scoring gap-free local alignments (maximal segment pairs) between the translated sequence and the protein sequence (Gish and States, 1993). Frameshifts and introns in the DNA sequence cause the existing programs to produce several small alignments, instead of a large alignment. It is possible that the large alignment is significant, whereas each of its pieces is not. We describe an approach to comparing a DNA sequence with a protein database. The approach enhances the existing methods by addressing the problems of frameshifts and introns. Our approach computes high-scoring chains of segment pairs, where segment pairs in a chain can be from different reading frames, and there can be an intervening DNA sequence between adjacent segment pairs in a chain. Our method has been implemented as a portable computer program named DNA-Protein Search (DPS). The DPS program was used to compare each of the following sequences with the Swiss-Prot database: Department of Computer Science, Michigan Technological University, Houghton, Michigan. 281

2 antid(s) aend(s) nend(s) HUANG the 1.8-megabase sequence of the Haemophilus influenzae Rd genome (Fleischmann et al., 1995), the megabase sequence of the Mycoplasma genitalium genome (Fraser et al., 1995), and the 0.56-megabase sequence of Saccharomyces cerevisiae chromosome VIII (Johnston et al., 1994). The comparisons found new regions that are similar to protein sequences. The sensitivity of DPS was evaluated using as test data the known coding regions of the three DNA sequences. The results demonstrate that the DPS program is a useful tool for finding the coding regions of the DNA sequence. The DPS program uses an order of magnitude less computer memory and is several times faster than the BLASTX program. Chains of segment pairs MATERIALS AND METHODS A segment pair between a DNA sequence and a protein sequence is a gap-free alignment of two segments of the sequences, where the length of the DNA segment is exactly three times the length of the protein segment, and each residue of the protein segment is aligned with a codon of the DNA segment. An aligned pair of codon and protein residue is a match if the codon codes for the residue. We begin with a matrix that gives each pair of protein residues a similarity score. Let g(a) denote the amino acid that is coded for by a nonstop codon a. The score of an aligned pair of nonstop codon a and protein residue b is the score of the pair of protein residues g(a) and b. The score of an aligned pair of stop codon and protein residue is the minimum score of pairs of protein residues. The score of a segment pair is the sum of the similarity scores of each aligned pair in the segment. For a segment pair s, let nstart(s) and nend(s) denote the starting and ending positions of the DNA segment in the DNA sequence, let astart(s) and aend(s) denote the starting and ending positions of the protein segment in the protein sequence, and let score(s) denote the score of s. The first antidiagonal of a segment pair s is defined to be antis(s) nstart(s) = + 3 X astart(s), and the last antidiagonal of s is defined to be antid(s) nend(s) = + 3 X aend(s), where the protein position is scaled up by a factor of 3, since an amino acid corresponds to three nucleotides. A chain of segment pairs is a list of segment pairs in increasing order of their last antidiagonal such that each segment pair is not far from its predecessor and adjacent segment pairs do not have a large overlap. Specifically, any two adjacent segment pairs s and s' in the list satisfy the requirement antis(s') <A, astart(s') > B, and nstart(s') > 3ß for some nonnegative integers A and B. Let close(s,s') denote the condition given above. A chain of segment pairs is used as an abstract representation of a local alignment between the DNA and protein sequences, with the segment pairs being ungapped portions of the alignment. Note that the use of the A threshold permits efficient computation of high-scoring chains. However, it prevents a chain from representing a local alignment involving introns of length greater than A + 3B. To solve this problem, a large value should be used for the A parameter if the DNA sequence is expected to contain long introns. Because the region of the DNA sequence between two adjacent segment pairs can be an intron, we do not charge any gap penalty for the region. However, we charge a linear penalty for the region of the protein sequence between two adjacent segment pairs. For some nonnegative integers q and r, the penalty for connecting two segment pairs s and s' is gap(s,s') = q + r X l(astart(s') aend(s)) - where l(x) = x if x > 0 and 0 otherwise. Under this gap scoring scheme, a DNA gap is not penalized even if it is not an intron. For two adjacent segment pairs s and s' in a chain, define tscore(s.s') to be the score of the longest portion of s' that has no overlap ' ' ' with s. The score of a chain c of segment pairs *i,_, -*- is defined to be score(c) = score(s\) + 2_,[tscore(Si-\,sj) m i = 2 gap(si-i,s )] To ensure that each segment pair contributes to the chain, we require that for any two adjacent segment pairs s and s' in the chain, tscore(s,s') be greater than a threshold. Two segment pairs s and s' are identi- 282

3 gap(sj,s ), nend(sf). /?(max(aover,- gap(sj,s )\l FAST SEQUENCE COMPARISON cal if astart(s) astart(s') and nstart(s) nstart(s'). Two chains of segment pairs = = are nonintersecting if they do not have any common segment pair (Chao and Miller, 1995). Note that two segment pairs s and s' that have a DNA region in common but different protein regions are not identical. So the chain that consists only of the segment pair s and the chain that consists only of the segment pair s' are nonintersecting. Fast computation of chains of segment pairs We describe a rapid method for comparing the DNA sequence with each protein sequence in the database. The segment pairs of score greater than a threshold / between the DNA sequence and a protein sequence are approximately computed using a hashing technique. A lookup table is constructed for the DNA sequence such that for each protein word of length W, the table provides the positions of all the regions of the DNA sequence that exactly code for the protein word. The value for W is usually between 3 and 5. For each position p of the protein sequence, the lookup table is used to locate the regions of the DNA sequence that code for the protein word of length W at position p. Each hit in the DNA sequence is extended in each direction until the score drops a distance D below the maximum score found so far (Altschul et al., 1990). If a hit is contained in a segment pair already considered, the hit is not extended. The segment pair of the maximum score found during the extension is saved if the score is greater than /. After the computation of segment pairs between the DNA and the protein sequences, nonintersecting chains of segment pairs with score greater than a threshold F are computed. Let Si,_,,j be a list of all the segment pairs in increasing order of their last antidiagonal. Note that the DNA segments in the segment pairs can be from different reading frames of the DNA sequence. Let H(s ) be the maximum score of chains ending with segment pair s. The matrix // is computed using the technique of Wilbur and Lipman (1983). //(ii) score(s\), = H(s ) max{score(sj),h(sj) = + tscore(sj,s ) _»j< i, close(sj,s ), and tscore(sj,s ) > 1} for i > 1 For segment pairs Sj and s with j < i, if the overlap thresholds A and B are violated or the score of the nonoverlapping portion of s is not large enough, then Sj is excluded from consideration as an immediate predecessor to s in any chain. To compute H(s ), we just need to use each sj in decreasing value of j such that antid(sj) > antis(si) A. To compute tscore(sj,s ) efficiently for each Sj, we precompute an array R of size B for s, where for 0 -? k < B, R(k) is the sum of the scores of the first k + 1 aligned pairs in s if there are at least k + 1 aligned pairs in s and score(s ) otherwise. Then for each Sj, let aover denote astart(sj) aend(sj), and let nover denote nstart(s ) If aover > 0 and nover > 0, then tscore(sj,sj) is equal to score(sj). Otherwise, we have tscore(sj,sj) = score(s ) )). The largest-scoring chains of segment pairs ending at each segment pair are partitioned into equivalence classes by the starting segment pair of the chains (Huang and Miller, 1991; Chao and Miller, 1995). Two chains are in the same class if and only if they begin with the same segment pair. The score of an equivalence class is the maximum score of chains in the class. The equivalence classes of score greater than F can be easily computed along with the matrix // as follows (Huang and Miller, 1991; Chao and Miller, 1995). Let G(s ) be the first segment pair of a largest-scoring chain ending with segment pair s. For each segment pair s, G(s ) is initialized to s. When H(s ) is set to H(sj) + tscore(sj,s ) G(s ) is set to G(sj). For an equivalence class c, let start(c) be the starting segment pair for the class, let end(c) be the ending segment pair of a largest-scoring chain in the class, and let score(c) be the score of the class. Thus, we have H(end(c)) score(c). The equivalence classes of = score greater than F are saved. After H(s ) and G(s ) are computed, we perform one of the two tasks below if H(s ) is greater than F. If there is an equivalence class c with start(c) G(s ), = set end(c) to s and score(c) to H(sj) if score(c) < H(s ). If there is no equivalence class c with start(c) G(s ), = create a new class c 283

4 HUANG with start(c) G(s ), end(c) = = s, and score(c) H(s ). After the computation of the equivalence classes is = completed, for each saved equivalence class, a largest-scoring chain in the class is obtained by a traceback technique. These largest-scoring chains are nonintersecting. To see this, if two chains were intersecting, that is, they had a common segment pair s, then the two chains would begin with the same segment pair G(s) and hence would belong to the same equivalence class. This contradicts the fact that the two chains are from different equivalence classes. Finally, each segment pair of score greater than F that is not in any chain already computed is reported. This step avoids missing a segment pair of score greater than F. RESULTS bp) of S. The algorithm for comparing a DNA sequence to a protein database was implemented as a computer program named DNA-Protein Search (DPS). The program was written in the C programming language. The DPS program compares both strands of the DNA sequence with the protein database. The DPS program was tested on the three long DNA sequences: the complete sequence (GSDB Accession L42023, 1,830,137 bp) of the H. influenzae Rd genome (Fleischmann et al., 1995), the complete sequence (GSDB Accession L43967, 580,070 bp) of the M. genitalium genome (Fraser et al., 1995), and the complete sequence (562,638 cerevisiae chromosome VIII (Johnston et al., 1994). The H. influenzae and M. genitalium sequences were obtained via anonymous ftp at ftp.tigr.org, and the S. cerevisiae sequence anonymous ftp at genome-ftp.stanford.edu. The annotation for each of the three DNA sequences provides a list of previously identified coding regions of the sequence. We performed two experiments with DPS on the three DNA sequences. In the first experiment, each of the three sequences was compared by DPS with the Swiss-Prot protein database (Release 32.0), which contains 49,340 sequence entries, comprising 17,385,503 amino acids. The comparisons found new regions that are similar to protein sequences. The second experiment tested how well DPS performed in finding the known coding regions with matches to the Swiss-Prot database. We also compared the performance of DPS with that of BLASTX on a Caenorhabditis elegans cosmid DNA sequence. All the comparisons were performed on a DEC AlphaServer /250 with 256 megabytes of memory. The following values were selected for the parameters in the first experiment: the extension distance D = 20, the segment pair score cutoff / 50, the chain = score cutoff F 150, the antidiagonal distance A = = 1000, the protein segment overlap length B 25, the = gap open penalty q 15, and the = gap extension penalty r = 1. The BLOSUM62 matrix was used (Henikoff and Henikoff, 1992). To choose a proper value for the word length parameter W, we measured the speed, memory, and sensitivity of the DPS program on the three DNA sequences for various values of W. The results shown in Table 1 was obtained via indicate that the word length parameter W has a major effect on the speed, memory, and sensitivity of the DPS program. In Table 1, the large increase in the memory requirement of DPS when W goes from 4 to 5 is due to the size of the lookup table for keeping all protein words of length 5, which is 235 = 6,436,343 for a protein alphabet of size 23. Note that the alphabet contains the three extra symbols: B, Z, and X. The use of the value 3 for the word length W achieves a high sensitivity at an acceptable speed. So the value 3 was used for W in the following comparisons. The DPS program computed a total of 13,425 chains of score greater than 150 on the H. influenzae sequence, 4,240 chains on the M. genitalium sequence, and 11,974 chains on the S. cerevisiae sequence. To Table 1. Time, Memory, and Sensitivity of DPS as a Function of the Word Length W Time (min) when W is Memory (mb) when W is No. of chains when W is Query H. influenzae ,425 11,955 10,152 M. genitalium ,240 3,689 3,059 S. cerevisiae ,974 7,173 3,

5 - FAST SEQUENCE COMPARISON Table 2. Regions of H. Influenzae RD Found by DPS Position" Score SPsh Swiss-Prot Accession Description ? P24560 Hypothetical 17.0 kda protein » P46885 Lytic murein transglycosylase a precursor P44282 Hypothetical protein HI P14181 LICA protein ^548098? P25614 Very hypothetical 22.8 kda protein * ? Q08318 DNA adenine methylase ^598921? P35649 Hypothetical 66.3 kda protein ^599823? P35648 Hemagglutinin ^620460? P46491 Hypothetical protein HI0597B « ? P20343 Very hypothetical CYSX protein P44836 Probable TONB-dependent receptor HI Q01996 Transferrin-binding protein 1 precursor * ? P15041 Very hypothetical 17.7 kda protein « P45371 Hypothetical protein ? P03821 Very hypothetical 10.2 kda protein P14928 ABA-inducible protein PHV A P44164 Hypothetical protein HI P44151 Hypothetical protein HI P45297 Hypothetical protein HI P22634 Glutamate racemase athe orientation of the region is shown by other strand of a previously found coding region. bthe number of segment pairs in the chain. an arrow ( or ). A region labeled with a question mark (?) is on the obtain new regions identified by DPS, the chains for each sequence were filtered by removing any chain whose DNA region overlaps with a previously identified coding region by at least 30%. The filtration was performed by a computer program, which took as input a list of known coding regions and output from DPS. jtie chains that passed the filter had overlapping DNA regions. To obtain chains whose DNA regions do not have a large overlap, any chain whose DNA region overlaps with the DNA region of another chain of a higher score by at least 70% was further removed. After the second filtration, 51 chains were obtained for the H. influenzae sequence, 31 chains for the M. genitalium sequence, and 36 chains for the S. cerevisiae sequence. Of the 51 regions of the H. influenzae sequence identified by DPS, 22 regions matched hypothetical protein sequences, and 25 regions matched gene products of H. influenzae. Fifteen of the new regions have a score greater than The 51 new regions matched 50 protein entries in the database, 29 of which were created in Releases 31.0 or 32.0, and the rest of which were created in prior releases. Note that the predicated coding regions of the H. influenzae sequence were previously compared with a database of nonredundant bacterial proteins constructed from Release 30.0 of the Swiss-Prot database (Fleischmann et al., 1995). The 51 regions include 19 regions that were found by The Institute for Genomic Research in late 1995, 2 partial repeats, and 10 regions, each of which is adjacent to an annotated region matching the same protein. Those 31 regions were further removed. Table 2 shows the 20 remaining regions of the 51 regions. Note that 9 of the 20 regions are on the other strand of a previously annotated coding region. Two of the 9 regions ( > and » ) received large scores of 741 and 843, respectively. Of the 31 regions of the M. genitalium sequence, 10 regions matched hypothetical protein sequence, and 23 regions matched gene products of M. genitalium. Fifteen regions matched one protein sequence (Swiss- Prot Accession P20796). Eight regions matched another protein sequence (Swiss-Prot Accession P22747), and three of the eight regions have a score greater than The two proteins are the only ones from M. genitalium. The 31 regions matched 10 protein entries in the database, 4 of which were created in Releases 285

6 HUANG Table 3. Regions of M. genitalium Found by DPS Position Score SPs Swiss-Prot Accession Description P38046 Nitrate transport ATP-binding protein NRTD P41053 Hypothetical 17.6 kda protein > P44489 Excinuclease ABC subunit C ^ P19210 Formamidopyrimidine-DNA glycosylase ^376447? P41755 NAD-specific glutamate dehydrogenase ^ P32583 Suppressor protein SRP P28009 Excinuclease ABC subunit A (fragment) P42423 Hypothetical ABC transporter 31.0 or 32.0, and the rest of which were created in prior releases. The M. genitalium sequence was previously analyzed in the same way as the H. influenzae sequence (Fraser et al., 1995). The 23 regions that matched gene products of M. genitalium were previously known as MgPa repeats and were not included in the annotation. Table 3 shows the 8 remaining regions, one of which is on the other strand of a previously annotated coding region. Of the 36 identified regions of the S. cerevisiae sequence, 29 regions matched hypothetical protein sequences, and 34 regions matched yeast protein sequences. Six regions matched one yeast protein sequence (Swiss-Prot Accession P40097). Two regions (811 ^ 1290 and ) from the opposite strands received two large scores of 822 and 828, both matching the same yeast protein sequence (Swiss- Prot Accession P43536). The 36 regions matched 26 protein entries in the database, 6 of which were created in Release 28.0 or prior releases. The cosmid-sized regions of the S. cerevisiae sequence were previously compared by BLASTX with Release 28.0 of the Swiss-Prot database (Johnston et al., 1994). The 36 regions are given in Table 4, where 10 of the 36 regions are on the other strand of a previously annotated coding region. It should be pointed out that some of the regions reported by DPS might be previously known but not annotated as coding regions because they appear to be nonfunctional. The DPS program found a number of matches on the reverse strand of a previously annotated coding region. It is not clear what these matches mean. However, the high-scoring matches might be coding regions. The DPS program is tolerant of frameshifts due to evolutionary differences and sequencing errors. Figure 1 shows a chain of 8 segment pairs between the S. cerevisiae sequence and a protein sequence. The chain has 5 frameshifts. The DPS program is also tolerant of introns in the DNA sequence. Figure 2 shows a chain of 2 segment pairs between the S. cerevisiae sequence and a protein sequence. There is an intervening DNA sequence of 313 bp between the two segment pairs. The sensitivity of DPS was evaluated by determining how many of the known coding regions with matches to the Swiss-Prot database were found by DPS. Since Release 32.0 of the Swiss-Prot database contains protein sequences derived from the H. influenzae and M. genitalium sequences, we removed from Release 32.0 all entries that were created after Release The resulting database was compared by DPS with each of the H. influenzae and M. genitalium sequences. Similarly, we removed all entries that were created after Release 28.0 and used the resulting database for comparison with the S. cerevisiae sequence. For each of the three comparisons, the segment pair score cutoff / was set to 30, the chain score cutoff F was set to 70, and the other parameters were set to the same values as in the first experiment. We used the small value for the chain score cutoff F because some known coding regions have a weak similarity to a Swiss-Prot entry. A known coding region is said to be found by DPS if at least 85% of the region is contained in a region reported by DPS or the region contains at least 85% of a region reported by DPS. This definition was motivated by the observation that some known coding regions only have a local similarity to a Swiss-Prot entry. From the annotation of the H. influenzae sequence, we extracted a total of 467 known coding regions with matches to the Swiss-Prot database. All but 2 of the 467 known coding regions were found by DPS. For the M. genitalium sequence, a total of 131 known coding regions with matches to the Swiss-Prot data- 286

7 FAST SEQUENCE COMPARISON base were obtained, and all but 4 of the 131 regions were found by DPS. From the annotation and feature table of the S. cerevisiae sequence, we obtained a total of 99 known coding regions with matches to the Swiss-Prot database. The DPS program found all the 99 known coding regions. The 6 regions of the H. influenzae and M. genitalium sequences that were missed by DPS have a weak similarity with the Swiss-Prot database. The DPS program was not able to identify the regions because the alignment of each region and the corresponding protein sequence in the database does not contain an exact word match of 3 amino acids. We also ran the BLASTX program (Gish and States, 1993) on the three DNA sequences. The executable code of the BLASTX program (Version 1.4) for Digital Unix was obtained via anonymous ftp at blast.wustl.edu. Because of its high memory requirement, BLASTX was not able to compare any of the three long DNA sequences with the Swiss-Prot database on the DEC AlphaServer with 256 megabytes of memory. Recall that DPS took at most 26 megabytes of memory on each of the three long DNA sequences. We compared the performance of DPS with that of BLASTX on a short C. elegans DNA sequence of 39,496 Table 4. Position Score SPs Regions of S. Cerevisiae Chromosome VIII Found by DPS Swiss-Prot Accession Description 811^1290? P43536 Hypothetical 18.6 kda protein » 1968? P43537 Hypothetical 16.5 kda protein 5081^ P43541 Hypothetical 17.5 kda protein 9112-* 9471? P25598 Hypothetical 23.2 kda protein P40519 Hypothetical 15.7 kda protein * P37299 Ubiquinol-cytochrome C reductase ^ P40097 Hypothetical 12.5 kda protein ^ P40097 Hypothetical 12.5 kda protein P40097 Hypothetical 12.5 kda protein P36032 Hypothetical 52.3 kda protein ^253119? P32583 Suppressor protein SRP ^366029? P14328 Spore coat protein SP > P40422 DNA-directed RNA polymerases P40519 Hypothetical 15.7 kda protein P40097 Hypothetical 12.5 kda protein ? P39558 Very hypothetical 13.2 kda protein ? P39558 Very hypothetical 13.2 kda protein * P39559 Hypothetical 11.1 kda protein P39711 Hypothetical 12.8 kda protein ^ P39561 Hypothetical 7.6 kda protein » P43552 Hypothetical 18.1 kda protein * P32768 Flocculation protein flol precursor ^ P39563 Hypothetical 11.1 kda protein ^ P08640 Glucoamylase SI/S2 precursor ^ P40442 Hypothetical 99.7 kda protein ^ P39564 Hypothetical 18.0 kda protein ^> P40097 Hypothetical 12.5 kda protein ^> P40097 Hypothetical 12.5 kda protein P39565 Hypothetical 10.9 kda protein P39566 Hypothetical 11.6 kda protein ? P39568 Very hypothetical 12.3 kda protein P43541 Hypothetical 17.5 kda protein P39973 Hypothetical 13.1 kda protein ? P43537 Hypothetical 16.5 kda protein P43536 Hypothetical 18.6 kda protein P19275 Viral protein TPX 287

8 Start End Score SPs Accession Description Chain P36032 Hypothetical 52.3 kd protein Frame: -1 Score: 69 Identity: 15/23 (65%) GCTGGTGGAGGTAAATTAGATTATGCACCTATCGGCGGGTTAGCGTTGGGCCGTAGCCTT AlaGlyGlySerLysLeuAspTyrAlaSerlleGlyGlyLeuAlaPheSerCysGlyLeu TTGGTCGCT PhePheAla 98 Frame: -3 Score: 114 Identity: 25/36 (69%) TTACCACGTATTTTTTCCTTTCAATTTATTATAAGCTTAGAGATACGATTTCAAGGCGCA LeuTyrHisIlePheSerlleGlnPhellelleGlyLeuGlylleLeuPheGlnGlyAla GCTCTTCTGTTAGCAGTTTTCTCTACGACTTTGTATGAGGTTCATCTC AlaLeuLeuLeuAlaAlaPheSerValThrLeuTrpGluIleTyrLeu 139 Frame: -3 Score: 63 Identity: 13/19 (68%) TTAGCTTCTATTCACATACCTACTATAACACTAATCCTGCTATGGTTCAGACCCAAA LeuAlaPhellePhelleProSerValThrLeuIleProLeuTrpPheArgAsnLys 167 Frame: -1 Score: 61 Identity: 11/17 (65%) GATGTGCCCTCAAACTTTATGATCTGGTTTCTTTTTTTATTTATATCGTTT AspValLeuSerAsnPheAlaValTrpLeuLeuPheGlyPheValSerPhe 259 Frame: -3 Score: 71 Identity: 15/19 (79%) ATGCTGGGTTACGTTGTTGTTTTATATTCCTTGTCTAGTATTACGGTCAGCAGAGGC MetLeuGlyTyrValValLeuLeuTyrSerLeuSerAspPheThrValSerLeuGly 279 Frame: -3 Score: 86 Identity: 16/35 (46%) TATGCATTATATGTGGTTAGCATTGACTCCTTAATAGAACGGCCAGTTATCAATCAAATT TyrValSerCysMetValSerValGlySerLeuLeuGlyArgProIleValGlyHisIle GCCTATAAGCATGGATCACTAGCGGCTAGTATTGTATTGCATTTG AlaAspLysTyrGlySerLeuThrValGlyMetlleLeuHisLeu 321 Frame: -2 Score: 95 Identity: 21/49 (43%) TTCATTTCCGCTTTTGCCTCAGGTGCAACAATAACTATCTTTGAGCATCGTACCACCAAT PheMetAlaAlaPheAlaLeuValAlaProIlelleGlyLeuGluLeuArgSerThrAsp ACAAAAGGATATGATCGTTATCATACAGAATTTTTCATGGATTTTGCATGTTTCGGTATA ThrAsnGlyAsnAspTyrTyrArgThrAlallePheValGlyPheAlaTyrPheGlyVal ACTTTACGTAGGTGAATTATTGAAGGG * * * SerLeuCysGlnTrpLeuLeuArgGly 429 Frame: -3 Score: 75 Identity: 19/36 (53%) TTATTGAAGGGCTTTGTGATAGGTGTGGATGTACCTGCCGTGCGTTGATGACCTTCAACT ****** LeuLeuArgGlyPhellelleAlaArgAspGluIleAlaValArgGluAlaTyrSerAla GATCGGAACAAATTGCTATTACATGTAAAAATCTCATGTATGAGCGAA AspGlnAsnGluLeuHisLeuAsnValLysLeuSerHisMetSerLys 461 FIG. 1. A chain of 8 segment pairs between the 5. cerevisiae sequence and a protein sequence (Swiss-Prot Accession P36032). The chain has 5 frameshifts. Although none of the 8 segment pairs has a score greater than 114, the chain has a score of

9 FAST SEQUENCE COMPARISON Start End Score SPs Accession Description Chain P40097 Hypothetical 12.5 kd protein Frame: 3 Score: 72 Identity: 17/41 (41%) GTGAATGATTTAATGATTGTTGCGATTTCCTTGTTGGTGAAGGCTATGATATCAGCTATG ValAsnPheValllellelleGlylleProLeuLeuIleGluAlaSerlleLeuCysIle CAGAATATACTAGTAGTTATCCACCAGAACATAAGAATCCTCAAAATGTAATTAAAAATC *** 3 2 GlnAsnlleLeuGluLeuLeuLeuLysGlylleGlylleLeuLysPheAsnArgTyrLeu CAC His 52 Frame: 1 Score: 124 Identity: 26/53 (49%) AATCTGTATTTCGACATAAGGTTATTATGATTATTTCTCCTTCCGTTCTATATTTTTCAT AsnArgTyrLeuHisThrllelleLeuArgLeuPhePheLeuSerPheTyrMetLeuHis TACCCTATTACATTGTCAATCCTTGCATTTCTGCTTTCATTAGAATTGATGACTGTTTCT PheProIleThrLeuSerlleLeuAlaPheGlnLeuProLeuAsnLeuLeuTrfcrLeuSer CAATGTTTATGCCATCTTCTTACAACTTATTTGACAATA GlnAlaSerPheHisLeuProArgSerHisMetlleLeu 100 FIG. 2. A chain of 2 segment pairs between the S. cerevisiae sequence and a protein sequence (Swiss-Prot Accession P40097). There is an intervening DNA sequence of 313 bp between the two segment pairs. The chain has 1 frameshift. The second segment pair contains an aligned pair of TGA and Arg. The stop codon TGA is probably due to a sequencing error at its first base. bp (GenBank Accession U23484). We selected this C. elegans DNA sequence because it was previously used to obtain BLAST benchmarks on workstations from several vendors. The BLASTX program was used to compare the C. elegans sequence with the Swiss-Prot database (Release 32.0) on the DEC AlphaServer, where the BLOSUM62 matrix was used, the B parameter was set to 5000, and the default values were used for the other BLASTX parameters. The BLASTX program produced a total of 2282 segment pairs of score greater than 70. The BLASTX program took 57.9 min and at least 11 megabytes of memory. On the other hand, DPS produced a total of 5968 chains of score greater than 70 on the C. elegans sequence, where the same values were used for the DPS parameters as in the second experiment. The DPS program took 10.8 min and 1 megabyte of memory. The DNA segment of each segment pair produced by BLASTX was completely contained in the DNA region of a chain produced by DPS. Of the 5968 chains reported by DPS, 293 chains (5%) do not share more than 30% of their DNA region with any BLASTX segment pair. DISCUSSION We have described the DPS program for comparing a DNA sequence to a protein sequence database. The DPS program can handle very huge DNA sequences because of its low computer memory requirement. The DPS program requires that every hit be an exact word match. So DPS is slightly less sensitive than BLASTX in computing segment pairs. However, DPS gains its sensitivity by combining segment pairs into chains, where segment pairs can be from different reading frames and there can be an intervening DNA sequence between adjacent segment pairs. The ability to compute a chain of close segment pairs makes it possible for DPS to determine the entire coding region. The DPS program quickly produces high-scoring chains of segment pairs between a DNA sequence and 289

10 HUANG a database of protein sequences. A chain just shows a partial correspondence between the DNA and protein sequences. To compute an alignment of the corresponding regions of the DNA and protein sequences, the NAP program of Huang and Zhang (1996) can be used. The NAP program computes a global alignment of a DNA sequence and a protein sequence. The NAP program is more sensitive but much slower than the DPS program. We are currently working on the integration of the two programs. Chao and Miller (1995) developed an efficient algorithm for computing k best nonintersecting chains of segment pairs. The algorithm was used in the SIM2 program to compute nonintersecting local alignments between two huge DNA sequences (Chao et al., 1995). The technique of Chao and Miller is more rigorous than the method used in this article for chaining segment pairs. It would be interesting to see if the technique of Chao and Miller is comparable to our method in speed. Our guess is that our method is faster than the method of Chao and Miller for handling a small number of segment pairs. An alternative definition of the score of a chain is possible. The score of a chain can be defined to be the maximum sum of the scores of nonoverlapping portions of the segments pairs minus the gap penalties. Additional computational efforts are needed to compute the score of a chain under this alternative definition. We need to determine, for any two adjacent segment pairs í and s' with an overlap in the chain, a break point of the overlap such that the sum of the score of the portion of s immediately before the point and the score of the portion of s' immediately after the point is maximum. Availability The source code of DPS is freely available for academic use on the WWW at and via anonymous ftp at cs.mtu.edu in directory/pub/huang. For commercial use, contact the author at huang@cs.mtu.edu. ACKNOWLEDGMENTS I would like to thank my colleague Steve Carr for kindly making his DEC AlphaServer available for this research. I also thank the reviewers for suggestions that significantly improved the article. Robert Fleischmann and Clyde Hutchison reviewed the new regions of the H. influenzae Rd and the M. genitalium genomes found by DPS. The work was supported in part by Michigan Research Excellence Fund. REFERENCES ALTSCHUL, S.F., GISH, W., MILLER, W., MYERS, E.W., and LIPMAN, D.J. (1990). Basic local alignment search tool. J Mol Biol 215, 403^10. CHAO, K.-M., and MILLER, W. (1995). Linear-space algorithms that build local alignments from fragments. Algorithmica 13, CHAO, K.-M., ZHANG, J., OSTELL, J., and MILLER, W. (1995). A local alignment tool for very long DNA sequences. Comput Appl Biosci 11, FLEISCHMANN, R.D., ADAMS, M.D., WHITE, O., CLAYTON, R.A., KIRKNESS, E.F., KERLAVAGE, A.R., et al. (1995). Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 269, FRASER, CM., GOCAYNE, J.D., WHITE, O., ADAMS, M.D., CLAYTON, R.A., FLEISCHMANN, R.D., et al. (1995). The minimal gene complement of Mycoplasma genitalium. Science 270, 397^403. GISH, W., and STATES, D.J. (1993). Identification of protein coding regions by database similarity search. Nature Genet 3, HENIKOFF, S., and HENIKOFF, J.G. (1992). Amino acid substitution matrices from protein blocks. Proc Nati Acad Sei USA 89, HUANG, X., and MILLER, W. (1991). A time-efficient, linear-space local similarity algorithm. Adv Appl Math 12, HUANG, X., and ZHANG, J. (1996). Methods for comparing a DNA sequence with a protein sequence. Comput Appl Biosci In press. JOHNSTON, M., ANDREWS, S., BRINKMAN, R., COOPER, J., DING, H., DOVER, J., et al. (1994). Complete nucleotide sequence of Saccharomyces cerevisiae chromosome VIII. Science 265,

11 FAST SEQUENCE COMPARISON PEARSON, W.R., and LIPMAN, D. (1988). Improved tools for biological sequence comparison. Proc Nati Acad Sei USA 85, WILBUR, W.J., and LIPMAN, D.J. (1983). Rapid similarity searches of nucleic acid and protein data banks. Proc Nati Acad Sei USA 80, Address reprint requests to: Xiaoqiu Huang Department of Computer Science Michigan Technological University 1400 Townsend Drive Houghton, MI

In-Depth Assessment of Local Sequence Alignment

In-Depth Assessment of Local Sequence Alignment 2012 International Conference on Environment Science and Engieering IPCBEE vol.3 2(2012) (2012)IACSIT Press, Singapoore In-Depth Assessment of Local Sequence Alignment Atoosa Ghahremani and Mahmood A.

More information

Single alignment: Substitution Matrix. 16 march 2017

Single alignment: Substitution Matrix. 16 march 2017 Single alignment: Substitution Matrix 16 march 2017 BLOSUM Matrix BLOSUM Matrix [2] (Blocks Amino Acid Substitution Matrices ) It is based on the amino acids substitutions observed in ~2000 conserved block

More information

Sequence Database Search Techniques I: Blast and PatternHunter tools

Sequence Database Search Techniques I: Blast and PatternHunter tools Sequence Database Search Techniques I: Blast and PatternHunter tools Zhang Louxin National University of Singapore Outline. Database search 2. BLAST (and filtration technique) 3. PatternHunter (empowered

More information

Basic Local Alignment Search Tool

Basic Local Alignment Search Tool Basic Local Alignment Search Tool Alignments used to uncover homologies between sequences combined with phylogenetic studies o can determine orthologous and paralogous relationships Local Alignment uses

More information

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot

More information

Grundlagen der Bioinformatik, SS 08, D. Huson, May 2,

Grundlagen der Bioinformatik, SS 08, D. Huson, May 2, Grundlagen der Bioinformatik, SS 08, D. Huson, May 2, 2008 39 5 Blast This lecture is based on the following, which are all recommended reading: R. Merkl, S. Waack: Bioinformatik Interaktiv. Chapter 11.4-11.7

More information

An Introduction to Sequence Similarity ( Homology ) Searching

An Introduction to Sequence Similarity ( Homology ) Searching An Introduction to Sequence Similarity ( Homology ) Searching Gary D. Stormo 1 UNIT 3.1 1 Washington University, School of Medicine, St. Louis, Missouri ABSTRACT Homologous sequences usually have the same,

More information

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1 Tiffany Samaroo MB&B 452a December 8, 2003 Take Home Final Topic 1 Prior to 1970, protein and DNA sequence alignment was limited to visual comparison. This was a very tedious process; even proteins with

More information

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Homology Modeling. Roberto Lins EPFL - summer semester 2005 Homology Modeling Roberto Lins EPFL - summer semester 2005 Disclaimer: course material is mainly taken from: P.E. Bourne & H Weissig, Structural Bioinformatics; C.A. Orengo, D.T. Jones & J.M. Thornton,

More information

BLAST. Varieties of BLAST

BLAST. Varieties of BLAST BLAST Basic Local Alignment Search Tool (1990) Altschul, Gish, Miller, Myers, & Lipman Uses short-cuts or heuristics to improve search speed Like speed-reading, does not examine every nucleotide of database

More information

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT 3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT.03.239 25.09.2012 SEQUENCE ANALYSIS IS IMPORTANT FOR... Prediction of function Gene finding the process of identifying the regions of genomic DNA that encode

More information

Sequence Alignment Techniques and Their Uses

Sequence Alignment Techniques and Their Uses Sequence Alignment Techniques and Their Uses Sarah Fiorentino Since rapid sequencing technology and whole genomes sequencing, the amount of sequence information has grown exponentially. With all of this

More information

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm Alignment scoring schemes and theory: substitution matrices and gap models 1 Local sequence alignments Local sequence alignments are necessary

More information

Introduction to sequence alignment. Local alignment the Smith-Waterman algorithm

Introduction to sequence alignment. Local alignment the Smith-Waterman algorithm Lecture 2, 12/3/2003: Introduction to sequence alignment The Needleman-Wunsch algorithm for global sequence alignment: description and properties Local alignment the Smith-Waterman algorithm 1 Computational

More information

Algorithms in Bioinformatics

Algorithms in Bioinformatics Algorithms in Bioinformatics Sami Khuri Department of omputer Science San José State University San José, alifornia, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri Pairwise Sequence Alignment Homology

More information

Small RNA in rice genome

Small RNA in rice genome Vol. 45 No. 5 SCIENCE IN CHINA (Series C) October 2002 Small RNA in rice genome WANG Kai ( 1, ZHU Xiaopeng ( 2, ZHONG Lan ( 1,3 & CHEN Runsheng ( 1,2 1. Beijing Genomics Institute/Center of Genomics and

More information

Introduction to Bioinformatics

Introduction to Bioinformatics Introduction to Bioinformatics Jianlin Cheng, PhD Department of Computer Science Informatics Institute 2011 Topics Introduction Biological Sequence Alignment and Database Search Analysis of gene expression

More information

Sequence analysis and comparison

Sequence analysis and comparison The aim with sequence identification: Sequence analysis and comparison Marjolein Thunnissen Lund September 2012 Is there any known protein sequence that is homologous to mine? Are there any other species

More information

BLAST: Basic Local Alignment Search Tool

BLAST: Basic Local Alignment Search Tool .. CSC 448 Bioinformatics Algorithms Alexander Dekhtyar.. (Rapid) Local Sequence Alignment BLAST BLAST: Basic Local Alignment Search Tool BLAST is a family of rapid approximate local alignment algorithms[2].

More information

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB Homology Modeling (Comparative Structure Modeling) Aims of Structural Genomics High-throughput 3D structure determination and analysis To determine or predict the 3D structures of all the proteins encoded

More information

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55 Pairwise Alignment Guan-Shieng Huang shieng@ncnu.edu.tw Dept. of CSIE, NCNU Pairwise Alignment p.1/55 Approach 1. Problem definition 2. Computational method (algorithms) 3. Complexity and performance Pairwise

More information

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I) CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I) Contents Alignment algorithms Needleman-Wunsch (global alignment) Smith-Waterman (local alignment) Heuristic algorithms FASTA BLAST

More information

Lecture 1, 31/10/2001: Introduction to sequence alignment. The Needleman-Wunsch algorithm for global sequence alignment: description and properties

Lecture 1, 31/10/2001: Introduction to sequence alignment. The Needleman-Wunsch algorithm for global sequence alignment: description and properties Lecture 1, 31/10/2001: Introduction to sequence alignment The Needleman-Wunsch algorithm for global sequence alignment: description and properties 1 Computational sequence-analysis The major goal of computational

More information

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Substitution score matrices, PAM, BLOSUM Needleman-Wunsch algorithm (Global) Smith-Waterman algorithm (Local) BLAST (local, heuristic) E-value

More information

Pairwise sequence alignment

Pairwise sequence alignment Department of Evolutionary Biology Example Alignment between very similar human alpha- and beta globins: GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKL G+ +VK+HGKKV A+++++AH+D++ +++++LS+LH KL GNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKL

More information

Local Alignment Statistics

Local Alignment Statistics Local Alignment Statistics Stephen Altschul National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, MD Central Issues in Biological Sequence Comparison

More information

Optimization of a New Score Function for the Detection of Remote Homologs

Optimization of a New Score Function for the Detection of Remote Homologs PROTEINS: Structure, Function, and Genetics 41:498 503 (2000) Optimization of a New Score Function for the Detection of Remote Homologs Maricel Kann, 1 Bin Qian, 2 and Richard A. Goldstein 1,2 * 1 Department

More information

EECS730: Introduction to Bioinformatics

EECS730: Introduction to Bioinformatics EECS730: Introduction to Bioinformatics Lecture 05: Index-based alignment algorithms Slides adapted from Dr. Shaojie Zhang (University of Central Florida) Real applications of alignment Database search

More information

Tools and Algorithms in Bioinformatics

Tools and Algorithms in Bioinformatics Tools and Algorithms in Bioinformatics GCBA815, Fall 2013 Week3: Blast Algorithm, theory and practice Babu Guda, Ph.D. Department of Genetics, Cell Biology & Anatomy Bioinformatics and Systems Biology

More information

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME MATHEMATICAL MODELING AND THE HUMAN GENOME Hilary S. Booth Australian National University, Australia Keywords: Human genome, DNA, bioinformatics, sequence analysis, evolution. Contents 1. Introduction:

More information

chapter 5 the mammalian cell entry 1 (mce1) operon of Mycobacterium Ieprae and Mycobacterium tuberculosis

chapter 5 the mammalian cell entry 1 (mce1) operon of Mycobacterium Ieprae and Mycobacterium tuberculosis chapter 5 the mammalian cell entry 1 (mce1) operon of Mycobacterium Ieprae and Mycobacterium tuberculosis chapter 5 Harald G. Wiker, Eric Spierings, Marc A. B. Kolkman, Tom H. M. Ottenhoff, and Morten

More information

Bioinformatics and BLAST

Bioinformatics and BLAST Bioinformatics and BLAST Overview Recap of last time Similarity discussion Algorithms: Needleman-Wunsch Smith-Waterman BLAST Implementation issues and current research Recap from Last Time Genome consists

More information

Whole Genome Alignments and Synteny Maps

Whole Genome Alignments and Synteny Maps Whole Genome Alignments and Synteny Maps IINTRODUCTION It was not until closely related organism genomes have been sequenced that people start to think about aligning genomes and chromosomes instead of

More information

SEQUENCE alignment is an underlying application in the

SEQUENCE alignment is an underlying application in the 194 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 8, NO. 1, JANUARY/FEBRUARY 2011 Pairwise Statistical Significance of Local Sequence Alignment Using Sequence-Specific and Position-Specific

More information

Genomics and bioinformatics summary. Finding genes -- computer searches

Genomics and bioinformatics summary. Finding genes -- computer searches Genomics and bioinformatics summary 1. Gene finding: computer searches, cdnas, ESTs, 2. Microarrays 3. Use BLAST to find homologous sequences 4. Multiple sequence alignments (MSAs) 5. Trees quantify sequence

More information

Reducing storage requirements for biological sequence comparison

Reducing storage requirements for biological sequence comparison Bioinformatics Advance Access published July 15, 2004 Bioinfor matics Oxford University Press 2004; all rights reserved. Reducing storage requirements for biological sequence comparison Michael Roberts,

More information

Chapter 17. From Gene to Protein. Biology Kevin Dees

Chapter 17. From Gene to Protein. Biology Kevin Dees Chapter 17 From Gene to Protein DNA The information molecule Sequences of bases is a code DNA organized in to chromosomes Chromosomes are organized into genes What do the genes actually say??? Reflecting

More information

Chapter 7: Rapid alignment methods: FASTA and BLAST

Chapter 7: Rapid alignment methods: FASTA and BLAST Chapter 7: Rapid alignment methods: FASTA and BLAST The biological problem Search strategies FASTA BLAST Introduction to bioinformatics, Autumn 2007 117 BLAST: Basic Local Alignment Search Tool BLAST (Altschul

More information

Fundamentals of database searching

Fundamentals of database searching Fundamentals of database searching Aligning novel sequences with previously characterized genes or proteins provides important insights into their common attributes and evolutionary origins. The principles

More information

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010 BLAST Database Searching BME 110: CompBio Tools Todd Lowe April 8, 2010 Admin Reading: Read chapter 7, and the NCBI Blast Guide and tutorial http://www.ncbi.nlm.nih.gov/blast/why.shtml Read Chapter 8 for

More information

BLAST: Target frequencies and information content Dannie Durand

BLAST: Target frequencies and information content Dannie Durand Computational Genomics and Molecular Biology, Fall 2016 1 BLAST: Target frequencies and information content Dannie Durand BLAST has two components: a fast heuristic for searching for similar sequences

More information

BIOINFORMATICS: An Introduction

BIOINFORMATICS: An Introduction BIOINFORMATICS: An Introduction What is Bioinformatics? The term was first coined in 1988 by Dr. Hwa Lim The original definition was : a collective term for data compilation, organisation, analysis and

More information

O 3 O 4 O 5. q 3. q 4. Transition

O 3 O 4 O 5. q 3. q 4. Transition Hidden Markov Models Hidden Markov models (HMM) were developed in the early part of the 1970 s and at that time mostly applied in the area of computerized speech recognition. They are first described in

More information

(Lys), resulting in translation of a polypeptide without the Lys amino acid. resulting in translation of a polypeptide without the Lys amino acid.

(Lys), resulting in translation of a polypeptide without the Lys amino acid. resulting in translation of a polypeptide without the Lys amino acid. 1. A change that makes a polypeptide defective has been discovered in its amino acid sequence. The normal and defective amino acid sequences are shown below. Researchers are attempting to reproduce the

More information

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU) Alignment principles and homology searching using (PSI-)BLAST Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU) http://ibivu.cs.vu.nl Bioinformatics Nothing in Biology makes sense except in

More information

Variable-length Intervals in Homology Search

Variable-length Intervals in Homology Search Variable-length Intervals in Homology Search Abhijit Chattaraj Hugh E. Williams School of Computer Science and Information Technology RMIT University, GPO Box 2476V Melbourne, Australia {abhijit,hugh}@cs.rmit.edu.au

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION Supplementary information S1 (box). Supplementary Methods description. Prokaryotic Genome Database Archaeal and bacterial genome sequences were downloaded from the NCBI FTP site (ftp://ftp.ncbi.nlm.nih.gov/genomes/all/)

More information

Effective Cluster-Based Seed Design for Cross-Species Sequence Comparisons

Effective Cluster-Based Seed Design for Cross-Species Sequence Comparisons Effective Cluster-Based Seed Design for Cross-Species Sequence Comparisons Leming Zhou and Liliana Florea 1 Methods Supplementary Materials 1.1 Cluster-based seed design 1. Determine Homologous Genes.

More information

CSE : Computational Issues in Molecular Biology. Lecture 6. Spring 2004

CSE : Computational Issues in Molecular Biology. Lecture 6. Spring 2004 CSE 397-497: Computational Issues in Molecular Biology Lecture 6 Spring 2004-1 - Topics for today Based on premise that algorithms we've studied are too slow: Faster method for global comparison when sequences

More information

Pairwise alignment. 2.1 Introduction GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSD----LHAHKL

Pairwise alignment. 2.1 Introduction GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSD----LHAHKL 2 Pairwise alignment 2.1 Introduction The most basic sequence analysis task is to ask if two sequences are related. This is usually done by first aligning the sequences (or parts of them) and then deciding

More information

Large-Scale Genomic Surveys

Large-Scale Genomic Surveys Bioinformatics Subtopics Fold Recognition Secondary Structure Prediction Docking & Drug Design Protein Geometry Protein Flexibility Homology Modeling Sequence Alignment Structure Classification Gene Prediction

More information

BME 5742 Biosystems Modeling and Control

BME 5742 Biosystems Modeling and Control BME 5742 Biosystems Modeling and Control Lecture 24 Unregulated Gene Expression Model Dr. Zvi Roth (FAU) 1 The genetic material inside a cell, encoded in its DNA, governs the response of a cell to various

More information

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ Proteomics Chapter 5. Proteomics and the analysis of protein sequence Ⅱ 1 Pairwise similarity searching (1) Figure 5.5: manual alignment One of the amino acids in the top sequence has no equivalent and

More information

A profile-based protein sequence alignment algorithm for a domain clustering database

A profile-based protein sequence alignment algorithm for a domain clustering database A profile-based protein sequence alignment algorithm for a domain clustering database Lin Xu,2 Fa Zhang and Zhiyong Liu 3, Key Laboratory of Computer System and architecture, the Institute of Computing

More information

2 Genome evolution: gene fusion versus gene fission

2 Genome evolution: gene fusion versus gene fission 2 Genome evolution: gene fusion versus gene fission Berend Snel, Peer Bork and Martijn A. Huynen Trends in Genetics 16 (2000) 9-11 13 Chapter 2 Introduction With the advent of complete genome sequencing,

More information

Alignment Strategies for Large Scale Genome Alignments

Alignment Strategies for Large Scale Genome Alignments Alignment Strategies for Large Scale Genome Alignments CSHL Computational Genomics 9 November 2003 Algorithms for Biological Sequence Comparison algorithm value scoring gap time calculated matrix penalty

More information

Bioinformatics Chapter 1. Introduction

Bioinformatics Chapter 1. Introduction Bioinformatics Chapter 1. Introduction Outline! Biological Data in Digital Symbol Sequences! Genomes Diversity, Size, and Structure! Proteins and Proteomes! On the Information Content of Biological Sequences!

More information

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) Bioinformática Sequence Alignment Pairwise Sequence Alignment Universidade da Beira Interior (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) 1 16/3/29 & 23/3/29 27/4/29 Outline

More information

Comparative genomics: Overview & Tools + MUMmer algorithm

Comparative genomics: Overview & Tools + MUMmer algorithm Comparative genomics: Overview & Tools + MUMmer algorithm Urmila Kulkarni-Kale Bioinformatics Centre University of Pune, Pune 411 007. urmila@bioinfo.ernet.in Genome sequence: Fact file 1995: The first

More information

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University Sequence Alignment: A General Overview COMP 571 - Fall 2010 Luay Nakhleh, Rice University Life through Evolution All living organisms are related to each other through evolution This means: any pair of

More information

Exercise 5. Sequence Profiles & BLAST

Exercise 5. Sequence Profiles & BLAST Exercise 5 Sequence Profiles & BLAST 1 Substitution Matrix (BLOSUM62) Likelihood to substitute one amino acid with another Figure taken from https://en.wikipedia.org/wiki/blosum 2 Substitution Matrix (BLOSUM62)

More information

Alignment & BLAST. By: Hadi Mozafari KUMS

Alignment & BLAST. By: Hadi Mozafari KUMS Alignment & BLAST By: Hadi Mozafari KUMS SIMILARITY - ALIGNMENT Comparison of primary DNA or protein sequences to other primary or secondary sequences Expecting that the function of the similar sequence

More information

Bioinformatics for Biologists

Bioinformatics for Biologists Bioinformatics for Biologists Sequence Analysis: Part I. Pairwise alignment and database searching Fran Lewitter, Ph.D. Head, Biocomputing Whitehead Institute Bioinformatics Definitions The use of computational

More information

Tools and Algorithms in Bioinformatics

Tools and Algorithms in Bioinformatics Tools and Algorithms in Bioinformatics GCBA815, Fall 2015 Week-4 BLAST Algorithm Continued Multiple Sequence Alignment Babu Guda, Ph.D. Department of Genetics, Cell Biology & Anatomy Bioinformatics and

More information

Biologically significant sequence alignments using Boltzmann probabilities

Biologically significant sequence alignments using Boltzmann probabilities Biologically significant sequence alignments using Boltzmann probabilities P. Clote Department of Biology, Boston College Gasson Hall 416, Chestnut Hill MA 02467 clote@bc.edu May 7, 2003 Abstract In this

More information

Supplemental Materials

Supplemental Materials JOURNAL OF MICROBIOLOGY & BIOLOGY EDUCATION, May 2013, p. 107-109 DOI: http://dx.doi.org/10.1128/jmbe.v14i1.496 Supplemental Materials for Engaging Students in a Bioinformatics Activity to Introduce Gene

More information

Designing and Testing a New DNA Fragment Assembler VEDA-2

Designing and Testing a New DNA Fragment Assembler VEDA-2 Designing and Testing a New DNA Fragment Assembler VEDA-2 Mark K. Goldberg Darren T. Lim Rensselaer Polytechnic Institute Computer Science Department {goldberg, limd}@cs.rpi.edu Abstract We present VEDA-2,

More information

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013 Sequence Alignments Dynamic programming approaches, scoring, and significance Lucy Skrabanek ICB, WMC January 31, 213 Sequence alignment Compare two (or more) sequences to: Find regions of conservation

More information

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types Exp 11- THEORY Sequence Alignment is a process of aligning two sequences to achieve maximum levels of identity between them. This help to derive functional, structural and evolutionary relationships between

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION Supplementary information S3 (box) Methods Methods Genome weighting The currently available collection of archaeal and bacterial genomes has a highly biased distribution of isolates across taxa. For example,

More information

Genome Annotation. Bioinformatics and Computational Biology. Genome sequencing Assembly. Gene prediction. Protein targeting.

Genome Annotation. Bioinformatics and Computational Biology. Genome sequencing Assembly. Gene prediction. Protein targeting. Genome Annotation Bioinformatics and Computational Biology Genome Annotation Frank Oliver Glöckner 1 Genome Analysis Roadmap Genome sequencing Assembly Gene prediction Protein targeting trna prediction

More information

Organization of Genes Differs in Prokaryotic and Eukaryotic DNA Chapter 10 p

Organization of Genes Differs in Prokaryotic and Eukaryotic DNA Chapter 10 p Organization of Genes Differs in Prokaryotic and Eukaryotic DNA Chapter 10 p.110-114 Arrangement of information in DNA----- requirements for RNA Common arrangement of protein-coding genes in prokaryotes=

More information

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega BLAST Multiple Sequence Alignments: Clustal Omega What does basic BLAST do (e.g. what is input sequence and how does BLAST look for matches?) Susan Parrish McDaniel College Multiple Sequence Alignments

More information

Characterization of New Proteins Found by Analysis of Short Open Reading Frames from the Full Yeast Genome

Characterization of New Proteins Found by Analysis of Short Open Reading Frames from the Full Yeast Genome YEAST VOL. 13: 1363 1374 (1997) Characterization of New Proteins Found by Analysis of Short Open Reading Frames from the Full Yeast Genome MIGUEL A. ANDRADE 1 *, ANTOINE DARUVAR 1, GEORG CASARI 2, REINHARD

More information

On the optimality of the standard genetic code: the role of stop codons

On the optimality of the standard genetic code: the role of stop codons On the optimality of the standard genetic code: the role of stop codons Sergey Naumenko 1*, Andrew Podlazov 1, Mikhail Burtsev 1,2, George Malinetsky 1 1 Department of Non-linear Dynamics, Keldysh Institute

More information

Sequence alignment methods. Pairwise alignment. The universe of biological sequence analysis

Sequence alignment methods. Pairwise alignment. The universe of biological sequence analysis he universe of biological sequence analysis Word/pattern recognition- Identification of restriction enzyme cleavage sites Sequence alignment methods PstI he universe of biological sequence analysis - prediction

More information

Related Courses He who asks is a fool for five minutes, but he who does not ask remains a fool forever.

Related Courses He who asks is a fool for five minutes, but he who does not ask remains a fool forever. CSE 527 Computational Biology http://www.cs.washington.edu/527 Lecture 1: Overview & Bio Review Autumn 2004 Larry Ruzzo Related Courses He who asks is a fool for five minutes, but he who does not ask remains

More information

Computational Biology

Computational Biology Computational Biology Lecture 6 31 October 2004 1 Overview Scoring matrices (Thanks to Shannon McWeeney) BLAST algorithm Start sequence alignment 2 1 What is a homologous sequence? A homologous sequence,

More information

Annotation of Drosophila grimashawi Contig12

Annotation of Drosophila grimashawi Contig12 Annotation of Drosophila grimashawi Contig12 Marshall Strother April 27, 2009 Contents 1 Overview 3 2 Genes 3 2.1 Genscan Feature 12.4............................................. 3 2.1.1 Genome Browser:

More information

Introduction to Bioinformatics Online Course: IBT

Introduction to Bioinformatics Online Course: IBT Introduction to Bioinformatics Online Course: IBT Multiple Sequence Alignment Building Multiple Sequence Alignment Lec1 Building a Multiple Sequence Alignment Learning Outcomes 1- Understanding Why multiple

More information

The Minimal-Gene-Set -Kapil PHY498BIO, HW 3

The Minimal-Gene-Set -Kapil PHY498BIO, HW 3 The Minimal-Gene-Set -Kapil Rajaraman(rajaramn@uiuc.edu) PHY498BIO, HW 3 The number of genes in organisms varies from around 480 (for parasitic bacterium Mycoplasma genitalium) to the order of 100,000

More information

20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, Global and local alignment of two sequences using dynamic programming

20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, Global and local alignment of two sequences using dynamic programming 20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, 2008 4 Pairwise alignment We will discuss: 1. Strings 2. Dot matrix method for comparing sequences 3. Edit distance 4. Global and local alignment

More information

Cellular Neuroanatomy I The Prototypical Neuron: Soma. Reading: BCP Chapter 2

Cellular Neuroanatomy I The Prototypical Neuron: Soma. Reading: BCP Chapter 2 Cellular Neuroanatomy I The Prototypical Neuron: Soma Reading: BCP Chapter 2 Functional Unit of the Nervous System The functional unit of the nervous system is the neuron. Neurons are cells specialized

More information

Scoring Matrices. Shifra Ben-Dor Irit Orr

Scoring Matrices. Shifra Ben-Dor Irit Orr Scoring Matrices Shifra Ben-Dor Irit Orr Scoring matrices Sequence alignment and database searching programs compare sequences to each other as a series of characters. All algorithms (programs) for comparison

More information

HMMs and biological sequence analysis

HMMs and biological sequence analysis HMMs and biological sequence analysis Hidden Markov Model A Markov chain is a sequence of random variables X 1, X 2, X 3,... That has the property that the value of the current state depends only on the

More information

E-SICT: An Efficient Similarity and Identity Matrix Calculating Tool

E-SICT: An Efficient Similarity and Identity Matrix Calculating Tool 2014, TextRoad Publication ISSN: 2090-4274 Journal of Applied Environmental and Biological Sciences www.textroad.com E-SICT: An Efficient Similarity and Identity Matrix Calculating Tool Muhammad Tariq

More information

A New Similarity Measure among Protein Sequences

A New Similarity Measure among Protein Sequences A New Similarity Measure among Protein Sequences Kuen-Pin Wu, Hsin-Nan Lin, Ting-Yi Sung and Wen-Lian Hsu * Institute of Information Science Academia Sinica, Taipei 115, Taiwan Abstract Protein sequence

More information

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre Bioinformatics Scoring Matrices David Gilbert Bioinformatics Research Centre www.brc.dcs.gla.ac.uk Department of Computing Science, University of Glasgow Learning Objectives To explain the requirement

More information

Computational Biology: Basics & Interesting Problems

Computational Biology: Basics & Interesting Problems Computational Biology: Basics & Interesting Problems Summary Sources of information Biological concepts: structure & terminology Sequencing Gene finding Protein structure prediction Sources of information

More information

Newly made RNA is called primary transcript and is modified in three ways before leaving the nucleus:

Newly made RNA is called primary transcript and is modified in three ways before leaving the nucleus: m Eukaryotic mrna processing Newly made RNA is called primary transcript and is modified in three ways before leaving the nucleus: Cap structure a modified guanine base is added to the 5 end. Poly-A tail

More information

CRITICA: Coding Region Identification Tool Invoking Comparative Analysis

CRITICA: Coding Region Identification Tool Invoking Comparative Analysis CRITICA: Coding Region Identification Tool Invoking Comparative Analysis Jonathan H. Badger and Gary J. Olsen Department of Microbiology, University of Illinois Gene recognition is essential to understanding

More information

Virus Classifications Based on the Haar Wavelet Transform of Signal Representation of DNA Sequences

Virus Classifications Based on the Haar Wavelet Transform of Signal Representation of DNA Sequences Virus Classifications Based on the Haar Wavelet Transform of Signal Representation of DNA Sequences MOHAMED EL-ZANATY 1, MAGDY SAEB 1, A. BAITH MOHAMED 1, SHAWKAT K. GUIRGUIS 2, EMAN EL-ABD 3 1. School

More information

From Gene to Protein

From Gene to Protein From Gene to Protein Gene Expression Process by which DNA directs the synthesis of a protein 2 stages transcription translation All organisms One gene one protein 1. Transcription of DNA Gene Composed

More information

Lecture 5,6 Local sequence alignment

Lecture 5,6 Local sequence alignment Lecture 5,6 Local sequence alignment Chapter 6 in Jones and Pevzner Fall 2018 September 4,6, 2018 Evolution as a tool for biological insight Nothing in biology makes sense except in the light of evolution

More information

Sequences, Structures, and Gene Regulatory Networks

Sequences, Structures, and Gene Regulatory Networks Sequences, Structures, and Gene Regulatory Networks Learning Outcomes After this class, you will Understand gene expression and protein structure in more detail Appreciate why biologists like to align

More information

BLAT The BLAST-Like Alignment Tool

BLAT The BLAST-Like Alignment Tool Resource BLAT The BLAST-Like Alignment Tool W. James Kent Department of Biology and Center for Molecular Biology of RNA, University of California, Santa Cruz, Santa Cruz, California 95064, USA Analyzing

More information

CGS 5991 (2 Credits) Bioinformatics Tools

CGS 5991 (2 Credits) Bioinformatics Tools CAP 5991 (3 Credits) Introduction to Bioinformatics CGS 5991 (2 Credits) Bioinformatics Tools Giri Narasimhan 8/26/03 CAP/CGS 5991: Lecture 1 1 Course Schedules CAP 5991 (3 credit) will meet every Tue

More information

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD Department of Computer Science University of Missouri 2008 Free for Academic

More information

Introduction to Bioinformatics Integrated Science, 11/9/05

Introduction to Bioinformatics Integrated Science, 11/9/05 1 Introduction to Bioinformatics Integrated Science, 11/9/05 Morris Levy Biological Sciences Research: Evolutionary Ecology, Plant- Fungal Pathogen Interactions Coordinator: BIOL 495S/CS490B/STAT490B Introduction

More information

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison CMPS 6630: Introduction to Computational Biology and Bioinformatics Structure Comparison Protein Structure Comparison Motivation Understand sequence and structure variability Understand Domain architecture

More information