Database search programs are one of the most important tools for analysis of DNA and protein sequences

MICROBIAL & COMPARATIVE GENOMICS Volume 1, Number 4, 1996 Mary Ann Liebert, Inc. Fast Comparison of a DNA Sequence with a Protein Sequence Database XIAOQIU HUANG ABSTRACT We describe a computer program, named DNA-Protein Search (DPS), for comparing a megabase DNA sequence with a protein sequence database. The DPS program addresses the problems of frameshifts and introns in the DNA sequence. The DPS program was used to compare each of the following sequences with the Swiss-Prot database: the 1.8-megabase sequence of the Haemophilus influenzae Rd genome, the 0.58-megabase sequence of the Mycoplasma genitalium genome, and the 0.56-megabase sequence of Saccharomyces cerevisiae chromosome VIII. The comparisons found new regions that are similar to protein sequences. The sensitivity of DPS was evaluated using as test data the known coding regions of the three DNA sequences. The results demonstrate that the DPS program is a useful tool for finding the coding regions of the DNA sequence. The DPS program uses an order of magnitude less computer memory and is several times faster than the BLASTX program. INTRODUCTION Database search programs are one of the most important tools for analysis of DNA and protein sequences (Pearson and Lipman, 1988; Altschul et al., 1990). One type of database search is to compare a newly determined DNA sequence with a protein database in order to find the coding regions of the DNA sequence (Johnston et al., 1994; Fleischmann et al., 1995; Fraser et al., 1995). Existing database search programs translate the DNA sequence in all six reading frames and rapidly compare each translated sequence with each protein sequence in the database. The TFASTA program of Pearson and Lipman (1988) computes a high-scoring local alignment between the translated sequence and the protein sequence. The BLASTX program computes high-scoring gap-free local alignments (maximal segment pairs) between the translated sequence and the protein sequence (Gish and States, 1993). Frameshifts and introns in the DNA sequence cause the existing programs to produce several small alignments, instead of a large alignment. It is possible that the large alignment is significant, whereas each of its pieces is not. We describe an approach to comparing a DNA sequence with a protein database. The approach enhances the existing methods by addressing the problems of frameshifts and introns. Our approach computes high-scoring chains of segment pairs, where segment pairs in a chain can be from different reading frames, and there can be an intervening DNA sequence between adjacent segment pairs in a chain. Our method has been implemented as a portable computer program named DNA-Protein Search (DPS). The DPS program was used to compare each of the following sequences with the Swiss-Prot database: Department of Computer Science, Michigan Technological University, Houghton, Michigan. 281

antid(s) aend(s) nend(s) HUANG the 1.8-megabase sequence of the Haemophilus influenzae Rd genome (Fleischmann et al., 1995), the 0.58- megabase sequence of the Mycoplasma genitalium genome (Fraser et al., 1995), and the 0.56-megabase sequence of Saccharomyces cerevisiae chromosome VIII (Johnston et al., 1994). The comparisons found new regions that are similar to protein sequences. The sensitivity of DPS was evaluated using as test data the known coding regions of the three DNA sequences. The results demonstrate that the DPS program is a useful tool for finding the coding regions of the DNA sequence. The DPS program uses an order of magnitude less computer memory and is several times faster than the BLASTX program. Chains of segment pairs MATERIALS AND METHODS A segment pair between a DNA sequence and a protein sequence is a gap-free alignment of two segments of the sequences, where the length of the DNA segment is exactly three times the length of the protein segment, and each residue of the protein segment is aligned with a codon of the DNA segment. An aligned pair of codon and protein residue is a match if the codon codes for the residue. We begin with a matrix that gives each pair of protein residues a similarity score. Let g(a) denote the amino acid that is coded for by a nonstop codon a. The score of an aligned pair of nonstop codon a and protein residue b is the score of the pair of protein residues g(a) and b. The score of an aligned pair of stop codon and protein residue is the minimum score of pairs of protein residues. The score of a segment pair is the sum of the similarity scores of each aligned pair in the segment. For a segment pair s, let nstart(s) and nend(s) denote the starting and ending positions of the DNA segment in the DNA sequence, let astart(s) and aend(s) denote the starting and ending positions of the protein segment in the protein sequence, and let score(s) denote the score of s. The first antidiagonal of a segment pair s is defined to be antis(s) nstart(s) = + 3 X astart(s), and the last antidiagonal of s is defined to be antid(s) nend(s) = + 3 X aend(s), where the protein position is scaled up by a factor of 3, since an amino acid corresponds to three nucleotides. A chain of segment pairs is a list of segment pairs in increasing order of their last antidiagonal such that each segment pair is not far from its predecessor and adjacent segment pairs do not have a large overlap. Specifically, any two adjacent segment pairs s and s' in the list satisfy the requirement antis(s') <A, astart(s') > B, and nstart(s') > 3ß for some nonnegative integers A and B. Let close(s,s') denote the condition given above. A chain of segment pairs is used as an abstract representation of a local alignment between the DNA and protein sequences, with the segment pairs being ungapped portions of the alignment. Note that the use of the A threshold permits efficient computation of high-scoring chains. However, it prevents a chain from representing a local alignment involving introns of length greater than A + 3B. To solve this problem, a large value should be used for the A parameter if the DNA sequence is expected to contain long introns. Because the region of the DNA sequence between two adjacent segment pairs can be an intron, we do not charge any gap penalty for the region. However, we charge a linear penalty for the region of the protein sequence between two adjacent segment pairs. For some nonnegative integers q and r, the penalty for connecting two segment pairs s and s' is gap(s,s') = q + r X l(astart(s') aend(s)) - where l(x) = x if x > 0 and 0 otherwise. Under this gap scoring scheme, a DNA gap is not penalized even if it is not an intron. For two adjacent segment pairs s and s' in a chain, define tscore(s.s') to be the score of the longest portion of s' that has no overlap ' ' ' with s. The score of a chain c of segment pairs *i,_, -*- is defined to be score(c) = score(s\) + 2_,[tscore(Si-\,sj) m i = 2 gap(si-i,s )] To ensure that each segment pair contributes to the chain, we require that for any two adjacent segment pairs s and s' in the chain, tscore(s,s') be greater than a threshold. Two segment pairs s and s' are identi- 282

gap(sj,s ), nend(sf). /?(max(aover,- gap(sj,s )\l FAST SEQUENCE COMPARISON cal if astart(s) astart(s') and nstart(s) nstart(s'). Two chains of segment pairs = = are nonintersecting if they do not have any common segment pair (Chao and Miller, 1995). Note that two segment pairs s and s' that have a DNA region in common but different protein regions are not identical. So the chain that consists only of the segment pair s and the chain that consists only of the segment pair s' are nonintersecting. Fast computation of chains of segment pairs We describe a rapid method for comparing the DNA sequence with each protein sequence in the database. The segment pairs of score greater than a threshold / between the DNA sequence and a protein sequence are approximately computed using a hashing technique. A lookup table is constructed for the DNA sequence such that for each protein word of length W, the table provides the positions of all the regions of the DNA sequence that exactly code for the protein word. The value for W is usually between 3 and 5. For each position p of the protein sequence, the lookup table is used to locate the regions of the DNA sequence that code for the protein word of length W at position p. Each hit in the DNA sequence is extended in each direction until the score drops a distance D below the maximum score found so far (Altschul et al., 1990). If a hit is contained in a segment pair already considered, the hit is not extended. The segment pair of the maximum score found during the extension is saved if the score is greater than /. After the computation of segment pairs between the DNA and the protein sequences, nonintersecting chains of segment pairs with score greater than a threshold F are computed. Let Si,_,,j be a list of all the segment pairs in increasing order of their last antidiagonal. Note that the DNA segments in the segment pairs can be from different reading frames of the DNA sequence. Let H(s ) be the maximum score of chains ending with segment pair s. The matrix // is computed using the technique of Wilbur and Lipman (1983). //(ii) score(s\), = H(s ) max{score(sj),h(sj) = + tscore(sj,s ) _»j< i, close(sj,s ), and tscore(sj,s ) > 1} for i > 1 For segment pairs Sj and s with j < i, if the overlap thresholds A and B are violated or the score of the nonoverlapping portion of s is not large enough, then Sj is excluded from consideration as an immediate predecessor to s in any chain. To compute H(s ), we just need to use each sj in decreasing value of j such that antid(sj) > antis(si) A. To compute tscore(sj,s ) efficiently for each Sj, we precompute an array R of size B for s, where for 0 -? k < B, R(k) is the sum of the scores of the first k + 1 aligned pairs in s if there are at least k + 1 aligned pairs in s and score(s ) otherwise. Then for each Sj, let aover denote astart(sj) aend(sj), and let nover denote nstart(s ) If aover > 0 and nover > 0, then tscore(sj,sj) is equal to score(sj). Otherwise, we have tscore(sj,sj) = score(s ) )). The largest-scoring chains of segment pairs ending at each segment pair are partitioned into equivalence classes by the starting segment pair of the chains (Huang and Miller, 1991; Chao and Miller, 1995). Two chains are in the same class if and only if they begin with the same segment pair. The score of an equivalence class is the maximum score of chains in the class. The equivalence classes of score greater than F can be easily computed along with the matrix // as follows (Huang and Miller, 1991; Chao and Miller, 1995). Let G(s ) be the first segment pair of a largest-scoring chain ending with segment pair s. For each segment pair s, G(s ) is initialized to s. When H(s ) is set to H(sj) + tscore(sj,s ) G(s ) is set to G(sj). For an equivalence class c, let start(c) be the starting segment pair for the class, let end(c) be the ending segment pair of a largest-scoring chain in the class, and let score(c) be the score of the class. Thus, we have H(end(c)) score(c). The equivalence classes of = score greater than F are saved. After H(s ) and G(s ) are computed, we perform one of the two tasks below if H(s ) is greater than F. If there is an equivalence class c with start(c) G(s ), = set end(c) to s and score(c) to H(sj) if score(c) < H(s ). If there is no equivalence class c with start(c) G(s ), = create a new class c 283

HUANG with start(c) G(s ), end(c) = = s, and score(c) H(s ). After the computation of the equivalence classes is = completed, for each saved equivalence class, a largest-scoring chain in the class is obtained by a traceback technique. These largest-scoring chains are nonintersecting. To see this, if two chains were intersecting, that is, they had a common segment pair s, then the two chains would begin with the same segment pair G(s) and hence would belong to the same equivalence class. This contradicts the fact that the two chains are from different equivalence classes. Finally, each segment pair of score greater than F that is not in any chain already computed is reported. This step avoids missing a segment pair of score greater than F. RESULTS bp) of S. The algorithm for comparing a DNA sequence to a protein database was implemented as a computer program named DNA-Protein Search (DPS). The program was written in the C programming language. The DPS program compares both strands of the DNA sequence with the protein database. The DPS program was tested on the three long DNA sequences: the complete sequence (GSDB Accession L42023, 1,830,137 bp) of the H. influenzae Rd genome (Fleischmann et al., 1995), the complete sequence (GSDB Accession L43967, 580,070 bp) of the M. genitalium genome (Fraser et al., 1995), and the complete sequence (562,638 cerevisiae chromosome VIII (Johnston et al., 1994). The H. influenzae and M. genitalium sequences were obtained via anonymous ftp at ftp.tigr.org, and the S. cerevisiae sequence anonymous ftp at genome-ftp.stanford.edu. The annotation for each of the three DNA sequences provides a list of previously identified coding regions of the sequence. We performed two experiments with DPS on the three DNA sequences. In the first experiment, each of the three sequences was compared by DPS with the Swiss-Prot protein database (Release 32.0), which contains 49,340 sequence entries, comprising 17,385,503 amino acids. The comparisons found new regions that are similar to protein sequences. The second experiment tested how well DPS performed in finding the known coding regions with matches to the Swiss-Prot database. We also compared the performance of DPS with that of BLASTX on a Caenorhabditis elegans cosmid DNA sequence. All the comparisons were performed on a DEC AlphaServer 2100 5/250 with 256 megabytes of memory. The following values were selected for the parameters in the first experiment: the extension distance D = 20, the segment pair score cutoff / 50, the chain = score cutoff F 150, the antidiagonal distance A = = 1000, the protein segment overlap length B 25, the = gap open penalty q 15, and the = gap extension penalty r = 1. The BLOSUM62 matrix was used (Henikoff and Henikoff, 1992). To choose a proper value for the word length parameter W, we measured the speed, memory, and sensitivity of the DPS program on the three DNA sequences for various values of W. The results shown in Table 1 was obtained via indicate that the word length parameter W has a major effect on the speed, memory, and sensitivity of the DPS program. In Table 1, the large increase in the memory requirement of DPS when W goes from 4 to 5 is due to the size of the lookup table for keeping all protein words of length 5, which is 235 = 6,436,343 for a protein alphabet of size 23. Note that the alphabet contains the three extra symbols: B, Z, and X. The use of the value 3 for the word length W achieves a high sensitivity at an acceptable speed. So the value 3 was used for W in the following comparisons. The DPS program computed a total of 13,425 chains of score greater than 150 on the H. influenzae sequence, 4,240 chains on the M. genitalium sequence, and 11,974 chains on the S. cerevisiae sequence. To Table 1. Time, Memory, and Sensitivity of DPS as a Function of the Word Length W Time (min) when W is Memory (mb) when W is No. of chains when W is Query 3 4 5 3 4 5 3 4 5 H. influenzae 661.1 41.6 4.6 26 28 99 13,425 11,955 10,152 M. genitalium 160.8 11.7 2.8 8 11 82 4,240 3,689 3,059 S. cerevisiae 161.0 11.9 2.8 8 11 82 11,974 7,173 3,969 284

- FAST SEQUENCE COMPARISON Table 2. Regions of H. Influenzae RD Found by DPS Position" Score SPsh Swiss-Prot Accession Description 9683797163? 228 1 P24560 Hypothetical 17.0 kda protein 131468» 132367 794 6 P46885 Lytic murein transglycosylase a precursor 239079239357 368 2 P44282 Hypothetical protein HI 1651 379504 379650 198 1 P14181 LICA protein 547814^548098? 165 2 P25614 Very hypothetical 22.8 kda protein 584609 -* 584782? 164 1 Q08318 DNA adenine methylase 597200^598921? 741 5 P35649 Hypothetical 66.3 kda protein 599126-^599823? 302 3 P35648 Hemagglutinin 2 619981^620460? 843 1 P46491 Hypothetical protein HI0597B 636500 «- 636874? 227 1 P20343 Very hypothetical CYSX protein 677134 677289 267 1 P44836 Probable TONB-dependent receptor HI0712 705296 705814 285 3 Q01996 Transferrin-binding protein 1 precursor 801494 -* 801925? 202 2 P15041 Very hypothetical 17.7 kda protein 1255486«-1255720 151 2 P45371 Hypothetical protein 1272474 1272722? 190 1 P03821 Very hypothetical 10.2 kda protein 1544947 1545246 151 1 P14928 ABA-inducible protein PHV A1 1545529 1546020 848 1 P44164 Hypothetical protein HI 1338 1716767-+1716979 280 1 P44151 Hypothetical protein HI 1280 1717878 1719322 2417 2 P45297 Hypothetical protein HI 1653 18231411823933 638 3 P22634 Glutamate racemase athe orientation of the region is shown by other strand of a previously found coding region. bthe number of segment pairs in the chain. an arrow ( or ). A region labeled with a question mark (?) is on the obtain new regions identified by DPS, the chains for each sequence were filtered by removing any chain whose DNA region overlaps with a previously identified coding region by at least 30%. The filtration was performed by a computer program, which took as input a list of known coding regions and output from DPS. jtie chains that passed the filter had overlapping DNA regions. To obtain chains whose DNA regions do not have a large overlap, any chain whose DNA region overlaps with the DNA region of another chain of a higher score by at least 70% was further removed. After the second filtration, 51 chains were obtained for the H. influenzae sequence, 31 chains for the M. genitalium sequence, and 36 chains for the S. cerevisiae sequence. Of the 51 regions of the H. influenzae sequence identified by DPS, 22 regions matched hypothetical protein sequences, and 25 regions matched gene products of H. influenzae. Fifteen of the new regions have a score greater than 1000. The 51 new regions matched 50 protein entries in the database, 29 of which were created in Releases 31.0 or 32.0, and the rest of which were created in prior releases. Note that the predicated coding regions of the H. influenzae sequence were previously compared with a database of nonredundant bacterial proteins constructed from Release 30.0 of the Swiss-Prot database (Fleischmann et al., 1995). The 51 regions include 19 regions that were found by The Institute for Genomic Research in late 1995, 2 partial repeats, and 10 regions, each of which is adjacent to an annotated region matching the same protein. Those 31 regions were further removed. Table 2 shows the 20 remaining regions of the 51 regions. Note that 9 of the 20 regions are on the other strand of a previously annotated coding region. Two of the 9 regions (597200 -> 598921 and 619981 -» 620460) received large scores of 741 and 843, respectively. Of the 31 regions of the M. genitalium sequence, 10 regions matched hypothetical protein sequence, and 23 regions matched gene products of M. genitalium. Fifteen regions matched one protein sequence (Swiss- Prot Accession P20796). Eight regions matched another protein sequence (Swiss-Prot Accession P22747), and three of the eight regions have a score greater than 1700. The two proteins are the only ones from M. genitalium. The 31 regions matched 10 protein entries in the database, 4 of which were created in Releases 285

HUANG Table 3. Regions of M. genitalium Found by DPS Position Score SPs Swiss-Prot Accession Description 49898-+50059 153 1 P38046 Nitrate transport ATP-binding protein NRTD 142095-142379 189 1 P41053 Hypothetical 17.6 kda protein 245647 -> 246087 208 3 P44489 Excinuclease ABC subunit C 319205^320020 397 4 P19210 Formamidopyrimidine-DNA glycosylase 375251-^376447? 224 3 P41755 NAD-specific glutamate dehydrogenase 429135-^430055 172 3 P32583 Suppressor protein SRP40 526634 526984 454 1 P28009 Excinuclease ABC subunit A (fragment) 576415576981 351 2 P42423 Hypothetical ABC transporter 31.0 or 32.0, and the rest of which were created in prior releases. The M. genitalium sequence was previously analyzed in the same way as the H. influenzae sequence (Fraser et al., 1995). The 23 regions that matched gene products of M. genitalium were previously known as MgPa repeats and were not included in the annotation. Table 3 shows the 8 remaining regions, one of which is on the other strand of a previously annotated coding region. Of the 36 identified regions of the S. cerevisiae sequence, 29 regions matched hypothetical protein sequences, and 34 regions matched yeast protein sequences. Six regions matched one yeast protein sequence (Swiss-Prot Accession P40097). Two regions (811 ^ 1290 and 561197561676) from the opposite strands received two large scores of 822 and 828, both matching the same yeast protein sequence (Swiss- Prot Accession P43536). The 36 regions matched 26 protein entries in the database, 6 of which were created in Release 28.0 or prior releases. The cosmid-sized regions of the S. cerevisiae sequence were previously compared by BLASTX with Release 28.0 of the Swiss-Prot database (Johnston et al., 1994). The 36 regions are given in Table 4, where 10 of the 36 regions are on the other strand of a previously annotated coding region. It should be pointed out that some of the regions reported by DPS might be previously known but not annotated as coding regions because they appear to be nonfunctional. The DPS program found a number of matches on the reverse strand of a previously annotated coding region. It is not clear what these matches mean. However, the high-scoring matches might be coding regions. The DPS program is tolerant of frameshifts due to evolutionary differences and sequencing errors. Figure 1 shows a chain of 8 segment pairs between the S. cerevisiae sequence and a protein sequence. The chain has 5 frameshifts. The DPS program is also tolerant of introns in the DNA sequence. Figure 2 shows a chain of 2 segment pairs between the S. cerevisiae sequence and a protein sequence. There is an intervening DNA sequence of 313 bp between the two segment pairs. The sensitivity of DPS was evaluated by determining how many of the known coding regions with matches to the Swiss-Prot database were found by DPS. Since Release 32.0 of the Swiss-Prot database contains protein sequences derived from the H. influenzae and M. genitalium sequences, we removed from Release 32.0 all entries that were created after Release 30.0. The resulting database was compared by DPS with each of the H. influenzae and M. genitalium sequences. Similarly, we removed all entries that were created after Release 28.0 and used the resulting database for comparison with the S. cerevisiae sequence. For each of the three comparisons, the segment pair score cutoff / was set to 30, the chain score cutoff F was set to 70, and the other parameters were set to the same values as in the first experiment. We used the small value for the chain score cutoff F because some known coding regions have a weak similarity to a Swiss-Prot entry. A known coding region is said to be found by DPS if at least 85% of the region is contained in a region reported by DPS or the region contains at least 85% of a region reported by DPS. This definition was motivated by the observation that some known coding regions only have a local similarity to a Swiss-Prot entry. From the annotation of the H. influenzae sequence, we extracted a total of 467 known coding regions with matches to the Swiss-Prot database. All but 2 of the 467 known coding regions were found by DPS. For the M. genitalium sequence, a total of 131 known coding regions with matches to the Swiss-Prot data- 286

FAST SEQUENCE COMPARISON base were obtained, and all but 4 of the 131 regions were found by DPS. From the annotation and feature table of the S. cerevisiae sequence, we obtained a total of 99 known coding regions with matches to the Swiss-Prot database. The DPS program found all the 99 known coding regions. The 6 regions of the H. influenzae and M. genitalium sequences that were missed by DPS have a weak similarity with the Swiss-Prot database. The DPS program was not able to identify the regions because the alignment of each region and the corresponding protein sequence in the database does not contain an exact word match of 3 amino acids. We also ran the BLASTX program (Gish and States, 1993) on the three DNA sequences. The executable code of the BLASTX program (Version 1.4) for Digital Unix was obtained via anonymous ftp at blast.wustl.edu. Because of its high memory requirement, BLASTX was not able to compare any of the three long DNA sequences with the Swiss-Prot database on the DEC AlphaServer with 256 megabytes of memory. Recall that DPS took at most 26 megabytes of memory on each of the three long DNA sequences. We compared the performance of DPS with that of BLASTX on a short C. elegans DNA sequence of 39,496 Table 4. Position Score SPs Regions of S. Cerevisiae Chromosome VIII Found by DPS Swiss-Prot Accession Description 811^1290? 822 1 P43536 Hypothetical 18.6 kda protein 1594 -» 1968? 374 1 P43537 Hypothetical 16.5 kda protein 5081^.5492 529 2 P43541 Hypothetical 17.5 kda protein 9112-* 9471? 190 1 P25598 Hypothetical 23.2 kda protein 9176491919 251 1 P40519 Hypothetical 15.7 kda protein 107888 -* 108112 408 1 P37299 Ubiquinol-cytochrome C reductase 116409^116667 227 2 P40097 Hypothetical 12.5 kda protein 133566^134160 171 2 P40097 Hypothetical 12.5 kda protein 133720 134002 243 2 P40097 Hypothetical 12.5 kda protein 202922 203964 357 8 P36032 Hypothetical 52.3 kda protein 252799^253119? 212 2 P32583 Suppressor protein SRP40 365361^366029? 176 2 P14328 Spore coat protein SP96 387233 -> 387442 354 1 P40422 DNA-directed RNA polymerases 389179 389337 262 1 P40519 Hypothetical 15.7 kda protein 466614466860 215 2 P40097 Hypothetical 12.5 kda protein 527765528238? 340 2 P39558 Very hypothetical 13.2 kda protein 527882528163? 437 1 P39558 Very hypothetical 13.2 kda protein 528967 -* 529249 400 3 P39559 Hypothetical 11.1 kda protein 530757531073 486 2 P39711 Hypothetical 12.8 kda protein 538737^538937 365 1 P39561 Hypothetical 7.6 kda protein 538746-» 539163 227 4 P43552 Hypothetical 18.1 kda protein 539791 -* 539922 168 1 P32768 Flocculation protein flol precursor 540795^541091 533 1 P39563 Hypothetical 11.1 kda protein 541751^543481 182 2 P08640 Glucoamylase SI/S2 precursor 542807^543388 668 1 P40442 Hypothetical 99.7 kda protein 543003^543485 834 1 P39564 Hypothetical 18.0 kda protein 543588^>543858 235 2 P40097 Hypothetical 12.5 kda protein 549293^>549554 233 2 P40097 Hypothetical 12.5 kda protein 550646 550936 499 1 P39565 Hypothetical 10.9 kda protein 551198 551494 511 1 P39566 Hypothetical 11.6 kda protein 555189555506? 496 1 P39568 Very hypothetical 12.3 kda protein 556121556572 442 3 P43541 Hypothetical 17.5 kda protein 557145557376 313 2 P39973 Hypothetical 13.1 kda protein 560381560893? 349 1 P43537 Hypothetical 16.5 kda protein 561197561676 828 1 P43536 Hypothetical 18.6 kda protein 562465562636 153 2 P19275 Viral protein TPX 287

Start End Score SPs Accession Description Chain 203964 202922 357 8 P36032 Hypothetical 52.3 kd protein Frame: -1 Score: 69 Identity: 15/23 (65%) 2039 64 GCTGGTGGAGGTAAATTAGATTATGCACCTATCGGCGGGTTAGCGTTGGGCCGTAGCCTT 2 03 905 7 6 AlaGlyGlySerLysLeuAspTyrAlaSerlleGlyGlyLeuAlaPheSerCysGlyLeu 95 203904 TTGGTCGCT 203896 96 PhePheAla 98 Frame: -3 Score: 114 Identity: 25/36 (69%) 2038 90 TTACCACGTATTTTTTCCTTTCAATTTATTATAAGCTTAGAGATACGATTTCAAGGCGCA 203831 104 LeuTyrHisIlePheSerlleGlnPhellelleGlyLeuGlylleLeuPheGlnGlyAla 12 3 203830 GCTCTTCTGTTAGCAGTTTTCTCTACGACTTTGTATGAGGTTCATCTC 2037 83 124 AlaLeuLeuLeuAlaAlaPheSerValThrLeuTrpGluIleTyrLeu 139 Frame: -3 Score: 63 Identity: 13/19 (68%) 2037 7 6 TTAGCTTCTATTCACATACCTACTATAACACTAATCCTGCTATGGTTCAGACCCAAA 203720 149 LeuAlaPhellePhelleProSerValThrLeuIleProLeuTrpPheArgAsnLys 167 Frame: -1 Score: 61 Identity: 11/17 (65%) 203 532 GATGTGCCCTCAAACTTTATGATCTGGTTTCTTTTTTTATTTATATCGTTT 2034 82 243 AspValLeuSerAsnPheAlaValTrpLeuLeuPheGlyPheValSerPhe 259 Frame: -3 Score: 71 Identity: 15/19 (79%) 203479 ATGCTGGGTTACGTTGTTGTTTTATATTCCTTGTCTAGTATTACGGTCAGCAGAGGC 2 03 423 261 MetLeuGlyTyrValValLeuLeuTyrSerLeuSerAspPheThrValSerLeuGly 279 Frame: -3 Score: 86 Identity: 16/35 (46%) 203413 TATGCATTATATGTGGTTAGCATTGACTCCTTAATAGAACGGCCAGTTATCAATCAAATT 2033 54 2 87 TyrValSerCysMetValSerValGlySerLeuLeuGlyArgProIleValGlyHisIle 30 6 2033 53 GCCTATAAGCATGGATCACTAGCGGCTAGTATTGTATTGCATTTG 2033 09 307 AlaAspLysTyrGlySerLeuThrValGlyMetlleLeuHisLeu 321 Frame: -2 Score: 95 Identity: 21/49 (43%) 203165 TTCATTTCCGCTTTTGCCTCAGGTGCAACAATAACTATCTTTGAGCATCGTACCACCAAT 203106 3 81 PheMetAlaAlaPheAlaLeuValAlaProIlelleGlyLeuGluLeuArgSerThrAsp 400 203105 ACAAAAGGATATGATCGTTATCATACAGAATTTTTCATGGATTTTGCATGTTTCGGTATA 20304 6 4 01 ThrAsnGlyAsnAspTyrTyrArgThrAlallePheValGlyPheAlaTyrPheGlyVal 42 0 20304 5 ACTTTACGTAGGTGAATTATTGAAGGG 203019... * * * 421... SerLeuCysGlnTrpLeuLeuArgGly 429 Frame: -3 Score: 75 Identity: 19/36 (53%) 203029 TTATTGAAGGGCTTTGTGATAGGTGTGGATGTACCTGCCGTGCGTTGATGACCTTCAACT 202970.........******... 426 LeuLeuArgGlyPhellelleAlaArgAspGluIleAlaValArgGluAlaTyrSerAla 44 5 202969 GATCGGAACAAATTGCTATTACATGTAAAAATCTCATGTATGAGCGAA 202 922 446 AspGlnAsnGluLeuHisLeuAsnValLysLeuSerHisMetSerLys 461 FIG. 1. A chain of 8 segment pairs between the 5. cerevisiae sequence and a protein sequence (Swiss-Prot Accession P36032). The chain has 5 frameshifts. Although none of the 8 segment pairs has a score greater than 114, the chain has a score of 357. 288

FAST SEQUENCE COMPARISON Start End Score SPs Accession Description Chain 133566 134160 171 2 P40097 Hypothetical 12.5 kd protein Frame: 3 Score: 72 Identity: 17/41 (41%) 13 3 5 66 GTGAATGATTTAATGATTGTTGCGATTTCCTTGTTGGTGAAGGCTATGATATCAGCTATG 13 3 62 5 12 ValAsnPheValllellelleGlylleProLeuLeuIleGluAlaSerlleLeuCysIle 31 133626 CAGAATATACTAGTAGTTATCCACCAGAACATAAGAATCCTCAAAATGTAATTAAAAATC 133685..... *** 3 2 GlnAsnlleLeuGluLeuLeuLeuLysGlylleGlylleLeuLysPheAsnArgTyrLeu 51 133686 CAC 133688 52 His 52 Frame: 1 Score: 124 Identity: 26/53 (49%) 13 40 02 AATCTGTATTTCGACATAAGGTTATTATGATTATTTCTCCTTCCGTTCTATATTTTTCAT 13 40 61 4 8 AsnArgTyrLeuHisThrllelleLeuArgLeuPhePheLeuSerPheTyrMetLeuHis 67 13 4062 TACCCTATTACATTGTCAATCCTTGCATTTCTGCTTTCATTAGAATTGATGACTGTTTCT 13 4121 68 PheProIleThrLeuSerlleLeuAlaPheGlnLeuProLeuAsnLeuLeuTrfcrLeuSer 87 13 4122 CAATGTTTATGCCATCTTCTTACAACTTATTTGACAATA 13 4160 88 GlnAlaSerPheHisLeuProArgSerHisMetlleLeu 100 FIG. 2. A chain of 2 segment pairs between the S. cerevisiae sequence and a protein sequence (Swiss-Prot Accession P40097). There is an intervening DNA sequence of 313 bp between the two segment pairs. The chain has 1 frameshift. The second segment pair contains an aligned pair of TGA and Arg. The stop codon TGA is probably due to a sequencing error at its first base. bp (GenBank Accession U23484). We selected this C. elegans DNA sequence because it was previously used to obtain BLAST benchmarks on workstations from several vendors. The BLASTX program was used to compare the C. elegans sequence with the Swiss-Prot database (Release 32.0) on the DEC AlphaServer, where the BLOSUM62 matrix was used, the B parameter was set to 5000, and the default values were used for the other BLASTX parameters. The BLASTX program produced a total of 2282 segment pairs of score greater than 70. The BLASTX program took 57.9 min and at least 11 megabytes of memory. On the other hand, DPS produced a total of 5968 chains of score greater than 70 on the C. elegans sequence, where the same values were used for the DPS parameters as in the second experiment. The DPS program took 10.8 min and 1 megabyte of memory. The DNA segment of each segment pair produced by BLASTX was completely contained in the DNA region of a chain produced by DPS. Of the 5968 chains reported by DPS, 293 chains (5%) do not share more than 30% of their DNA region with any BLASTX segment pair. DISCUSSION We have described the DPS program for comparing a DNA sequence to a protein sequence database. The DPS program can handle very huge DNA sequences because of its low computer memory requirement. The DPS program requires that every hit be an exact word match. So DPS is slightly less sensitive than BLASTX in computing segment pairs. However, DPS gains its sensitivity by combining segment pairs into chains, where segment pairs can be from different reading frames and there can be an intervening DNA sequence between adjacent segment pairs. The ability to compute a chain of close segment pairs makes it possible for DPS to determine the entire coding region. The DPS program quickly produces high-scoring chains of segment pairs between a DNA sequence and 289

HUANG a database of protein sequences. A chain just shows a partial correspondence between the DNA and protein sequences. To compute an alignment of the corresponding regions of the DNA and protein sequences, the NAP program of Huang and Zhang (1996) can be used. The NAP program computes a global alignment of a DNA sequence and a protein sequence. The NAP program is more sensitive but much slower than the DPS program. We are currently working on the integration of the two programs. Chao and Miller (1995) developed an efficient algorithm for computing k best nonintersecting chains of segment pairs. The algorithm was used in the SIM2 program to compute nonintersecting local alignments between two huge DNA sequences (Chao et al., 1995). The technique of Chao and Miller is more rigorous than the method used in this article for chaining segment pairs. It would be interesting to see if the technique of Chao and Miller is comparable to our method in speed. Our guess is that our method is faster than the method of Chao and Miller for handling a small number of segment pairs. An alternative definition of the score of a chain is possible. The score of a chain can be defined to be the maximum sum of the scores of nonoverlapping portions of the segments pairs minus the gap penalties. Additional computational efforts are needed to compute the score of a chain under this alternative definition. We need to determine, for any two adjacent segment pairs í and s' with an overlap in the chain, a break point of the overlap such that the sum of the score of the portion of s immediately before the point and the score of the portion of s' immediately after the point is maximum. Availability The source code of DPS is freely available for academic use on the WWW at http://www.cs.mtu.edu/faculty/huang.html and via anonymous ftp at cs.mtu.edu in directory/pub/huang. For commercial use, contact the author at huang@cs.mtu.edu. ACKNOWLEDGMENTS I would like to thank my colleague Steve Carr for kindly making his DEC AlphaServer available for this research. I also thank the reviewers for suggestions that significantly improved the article. Robert Fleischmann and Clyde Hutchison reviewed the new regions of the H. influenzae Rd and the M. genitalium genomes found by DPS. The work was supported in part by Michigan Research Excellence Fund. REFERENCES ALTSCHUL, S.F., GISH, W., MILLER, W., MYERS, E.W., and LIPMAN, D.J. (1990). Basic local alignment search tool. J Mol Biol 215, 403^10. CHAO, K.-M., and MILLER, W. (1995). Linear-space algorithms that build local alignments from fragments. Algorithmica 13, 106-134. CHAO, K.-M., ZHANG, J., OSTELL, J., and MILLER, W. (1995). A local alignment tool for very long DNA sequences. Comput Appl Biosci 11, 147-153. FLEISCHMANN, R.D., ADAMS, M.D., WHITE, O., CLAYTON, R.A., KIRKNESS, E.F., KERLAVAGE, A.R., et al. (1995). Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 269, 496-512. FRASER, CM., GOCAYNE, J.D., WHITE, O., ADAMS, M.D., CLAYTON, R.A., FLEISCHMANN, R.D., et al. (1995). The minimal gene complement of Mycoplasma genitalium. Science 270, 397^403. GISH, W., and STATES, D.J. (1993). Identification of protein coding regions by database similarity search. Nature Genet 3, 266-272. HENIKOFF, S., and HENIKOFF, J.G. (1992). Amino acid substitution matrices from protein blocks. Proc Nati Acad Sei USA 89, 10915-10919. HUANG, X., and MILLER, W. (1991). A time-efficient, linear-space local similarity algorithm. Adv Appl Math 12, 337-357. HUANG, X., and ZHANG, J. (1996). Methods for comparing a DNA sequence with a protein sequence. Comput Appl Biosci In press. JOHNSTON, M., ANDREWS, S., BRINKMAN, R., COOPER, J., DING, H., DOVER, J., et al. (1994). Complete nucleotide sequence of Saccharomyces cerevisiae chromosome VIII. Science 265, 2077-2082. 290

FAST SEQUENCE COMPARISON PEARSON, W.R., and LIPMAN, D. (1988). Improved tools for biological sequence comparison. Proc Nati Acad Sei USA 85, 2444-2448. WILBUR, W.J., and LIPMAN, D.J. (1983). Rapid similarity searches of nucleic acid and protein data banks. Proc Nati Acad Sei USA 80, 726-730. Address reprint requests to: Xiaoqiu Huang Department of Computer Science Michigan Technological University 1400 Townsend Drive Houghton, MI 49931 291