A profile-based protein sequence alignment algorithm for a domain clustering database

A profile-based protein sequence alignment algorithm for a domain clustering database Lin Xu,2 Fa Zhang and Zhiyong Liu 3, Key Laboratory of Computer System and architecture, the Institute of Computing Technology, Chinese Academy of Sciences 2, Graduate School of Chinese Academy of Sciences 3, National Natural Science Foundation of China Abstract- Aiming at the two main shortcomings in Homology Modeling, we have designed and established a domain clustering database. Searching the database is a fundamental work for it. However, current alignment algorithms are mainly based on the sequences, ignoring the structure conservation in domain. This paper proposed a profile-based alignment which considers the structure information into the profile, based on the character of our domain database. We designed an experiment within the database. The results show that both the quality and sensitivity of our scheme are better than pure Smith-Waterman and sequence-based profile algorithms. We strongly believe that this work can help to improve the protein structure prediction. I. INTRODUCTION Sequence alignment is a fundamental tool in Computational Biology and Bioinformatics. With this tool we can get a lot of useful information, such as which genes have the same function, which RNAs belong to the same class and which proteins have the same structure topology, etc. Moreover, in the area of protein structure prediction, obtaining the alignment between structure-unknown protein sequence (query) and its structure-known homologies (templates) is the most fundamental step in the modeling processing, and the quality of the alignment affects the prediction result greatly. Generally speaking, there are three categories of methods to create an alignment: single sequence based, multiple sequence alignments and profile based. Single sequence based methods use the standard dynamic programming algorithm to generate the alignment, for example, Needelman-Wunch algorithm [] and Smith-Waterman algorithm [2]. Since this method only utilizes the sequence information, the quality of the alignment will drop greatly when the sequence identity is less than 30%. Multiple sequence alignments create alignment between more than three sequences. Since simultaneous alignment of several sequences is a NP-hard computational problem, most of the methods use a heuristic algorithm, such as ClustalW [3], DIALIGN [4] and T-COFFEE [5]. However, the alignment quality and computational cost are two critical problems in this kind method. Profile-based methods have greatly accelerated with the development of the PSI-BLAST program by Alschul et al [6]. These methods improve the alignment quality by using a profile to describe the characters in the similar sequences and aligning a sequence or a profile with other profile. Because the profile accurate records the most relevant information from the multiple sequence alignment, the quality of this method is better than the others. Several groups have published profile-to-profile alignment methods, such as PSI-BLAST [6] and HMMER [7]. Most of profile-based methods use standard Smith-Waterman local alignment method, but they vary significantly in a number of important respects, such as scoring functions, gap penalties, weighting schemes and whether adding a secondary-structure substitution matrix. Although all these methods use different information and different methodologies, an accurate alignment still remains a major challenge, especially when the sequence similarity fell into the twilight zone (<=25% sequence identity). This is largely resulted from the fact that often it is very difficult to obtain a correct scoring matrix where not only mutations but more importantly insertions and deletions -4244-0623-4/06/$20.00 2006 IEEE

occurring during evolution. It is generally accepted that structural alignment based only on the three-dimensional coordinates would accurately represent the corresponding residues as well as the boundary and site of any gaps. As we stated above, sequence alignment is a critical factor in protein structure prediction. Also, it is important to note that existing modeling techniques still use sequence alignment to select structural templates, however these techniques suffer from several shortcomings, which limit the practical applicability of comparative modeling:. Since the number of the known structures is smaller than that of sequences greatly, the lack of templates is a big problem. 2. The query-template alignment quality drops greatly when the sequence identities fall blow 30%. Facing the first problem, we designed and constructed a domain-based templates database in our previous study. For each template in this database, we superimposed the corresponding structures and provided a multiple structure alignment based only on the three-dimensional information of structure involved. The detail is described in [8]. Facing the second problem, we present a method to extract a profile from the multiple structure alignment of each template in our database. Also we develop a query-template alignment method using the profile. Preliminary experimental results show that our profile-based alignment method significant improves the accuracy of selecting structural template. The organization of this paper is as follows. The next section briefly reviews the construction of domain-based template database. The subsequent section describes the profile-based alignment using our domain-based template. Then we provide and discuss experimental results on some datasets. Finally, we present the main conclusion of the paper and discuss for the future work. II. CONSTRUCTUION DOMAIN-BASED TEMPLATE DATABASE Since proteins evolve with their structural and functional domains as independent units, proteins and their structures can be largely described as combinations of conserved protein domains. This motivates us to construct a domain-based template database to increase the likelihood of widely applicable structure templates. We first searched all the InterPro [9] domains in PDB [0] using the program iprscan software [], and mapped the corresponding protein structures in the PDB. All the PDB protein sequences in this project were parsed directly from the structural records reorganized by MSD database. From the protein family and superfamily databases, such as Pfam [2], SCOP [3], SMART [4] and TIGERFAM [5], we then used the program HMMER [7] to obtain the consensus sequences for each InterPro domain. Next, we partitioned the structural correspondences from the PDB for each InterPro domain, and constructed a primary domain cluster. For each domain cluster, we compared all the sequences of domains involved with the relevant consensus sequence based on sequence similarity, then chose and refined the domain cluster by removing the structure whose sequence identity or structure similarity is less than a pre-defined threshold. Since all the domains in each of clusters are conserved in both sequence and structure, we adopt the conserved structures as template for the relevant domain cluster. The domain based template database can be accessed by website: http://www.nhpc.ac.cn/nhpc/english/research/protein.jsp. A B Fig.. Two typical structural ensembles of conserved domain clusters Two typical structure ensembles are illustrated in Fig.. Ensemble A shows the structure ensemble of domain cluster IPR00008, and includes 8 individual structures. All the RMSD of the structures involved are less than Å. Ensemble B shows the structure ensemble of PDZ domain cluster (IPR00478) which including 4 structures. All the RMSD of the structures involved are less than 3 Å. To highlight the conserved structural regions in each domain cluster, we superimposed these conserved structures

using Dali [6] or CE [7] algorithm. Based on the conserved structural ensemble for each domain cluster, we also generated a multiple structural alignment for each domain cluster, purely from the backbone coordinates of residues. Since this structural alignment is independent of the sequence similarity, it provides more sensitive and position-specific signatures than the sequence alignment. The detail of construction the database is described in [8]. III. PROFILE-BASED ALIGNMENT USING THE DOMAIN-BASED TEMPLATE DATABASE As we know, an accurate and complete query-templates alignment is very critical in comparative modeling. Most of current modeling techniques are based on sequence information to generate the query-templates alignment, ignoring structure information. Pair-wise alignment algorithm such as Smith-Waterman, FASTA [8] and BLAST [9] can not capture the full joint information content of the group even when the multiple-alignment consensus sequence is used as the query. Since Gribskov first introduced the idea of profiles to search database [20], sequence alignment profiles have been shown to be very powerful in creating accurate sequence alignment. In recent years, many strategies, including structural information and surface accessibility, were proposed to determine the profile or the position-specific gap penalties. Since our template database provided a multiple structural alignment and a superimposed structure ensemble for each domain cluster (template), we strongly believe that we can build more reliable and sensitive profiles, using both the multiple structure alignment and relevant structure information. In order to build a profile for each domain, we analyzed the sequence and structure characters of the database by a statistics way, since we believe that most domains in the database are conserved in sequence and structure. Then a profile for each domain cluster can be built based on the multiple structure alignment and relevant structure information. The query sequence is then searched the database by the position-specific scoring matrix of each domain cluster to valid our profile-based alignment. A. Conservation Statistics from Sequence and Structure Information in Domain Database In order to build the profile for each template from its sequence and structure information, we first analyzed the relationship of the residues and structural information in the template database by a statistics way. For each position of amino acid in each template, we classified the coordinates and residue type based on the superimposed structure ensemble. Here, the spatial location of the alpha carbon atom is considered as presentation of each amino acid. Then, we extracted the alpha carbon atoms from each structure, aligned these positions and clustered them by calculating the distance between each two positions. We used a hierarchical clustering algorithm to implement it, and set Å as distance cutoff. We listed some statistics results for several domain clusters in the database, as shown in table I. In the table, column is the domain cluster ID, column 2 is the percent of only one amino acid type in one coordinate cluster, and column 3 to 7 are the percent of two to six residue types in one coordinate cluster respectively. The table indicated the distribution of the number of amino acid type in each coordinate cluster. It shows that amino acids in one coordinate cluster are more likely to be one residue type, when using cutoff Å to cluster the positions. TABLE I SOME STATISTICS RESULTS FROM THE DATABASE ID 2 3 4 5 6 IPR00000 84.0 8.82 3.85 2.86 0.25 0.2 IPR000003 85.68 0.58 3.48 0.27 IPR000005 96.02 3.98 IPR000006 96.52 2.6 0.87 IPR000007 00.00 IPR000008 90.02 9.52 0.45 IPR00002 96.27 3.73 IPR00004 87.80.4.06 IPR000023 87.63.73 0.55 0.09 Fig.2 shows the same analysis results for the whole database, one residue type in a coordinate cluster is 95.66% for the entire domain clusters, two residue types is 3.63%, three residue types is 0.49%, and more than four residue

types in one cluster is only 0.23%. Therefore, an overwhelming majority of the amino acids in one position belong to the same residue type. This conservative property between sequential and structural information can help us to build more accurate profile. Fig 2. The distribution of different types in one coordinate cluster B. Building Profile from Sequence and Structure Information Based on the sequence and structure information, we build a profile for each domain cluster. The profile is defined as a sequence position-specific scoring matrix M(p,a) composed of 2 columns and m rows (m = length of alignment). The first 20 columns of each row specify the score of the 20 amino acid residues respectively. An additional column contains a penalty for insertions or deletions at that position. In position p of alignment A (N structures), AA(a) is defined as the class of amino acid type a, SS(i) is the class of carbon alpha coordinates clustering i (which is mentioned in the last section) and the W(p,a) is the weight for the appearance of amino acid a at position p. For the sequence information, the weight of each amino acid type is determined as follows: Supposed that there are n(a) items in residue class AA(a), then the average weight for class AA(a) is W (p,a) = n(a)/n. For the structure information, the weight for each class is determined as follows: Supposed that there are n(s i ) items in class SS(i), then the weight is W 2 (p,s i ) = n(s i )/N. Then, the W(p,a) can be calculated with W and W 2. W ( p, a) = [ W 3.63 0.49 0.23 ( p, a) * AllSS ( i) n( a, i) * W 2 ( p, Here, σ is a normalized unit which ensures that W ( p, a) =. a { a mino 95.66 acid type} s i.00% 2.00% 3.00% >3% )]* σ And n(a,i) is the number of class SS(i) at position a. Then the position-specific scoring matrix M(p,a) is made by the equation that: M ( p, a) = 20 b= W ( p, b) * Y( a, b) Where Y(a,b) is a scoring matrix, such as BLOSUM62. The profile specific position-dependent penalties for insertions and deletions can be set a high value to prevent insertions in positions where no gaps occurs and set a low value to allow insertions in regions where insertions are observed in the alignment. The penalty applied, gap(l), for creating a gap during the match of profile to query is given by gap(l) = gap [gap_open+gap_ext*l], in which gap is the penalty given in the last column of the profile, L is the number of residue positions in the gap, and gap_open and gap_ext are the penalties for gap opening and gap extension, respectively. C. profile-based alignment Since our profile accurate record both sequence and structure properties, with the profile of each domain cluster, we can use the Smith-Waterman local alignment algorithm to find which domain the query sequence more likely belongs to. The major difference of our profile-based alignment from dynamic programming algorithm and other profile-based alignment algorithms lies in the scoring scheme. Our profile-based alignment uses not only the sequence information derived from domain cluster, but also uses the structure information extracted from superimposed structure ensembles, whereas, in the raw dynamic programming algorithm, the score is based on the comparison of amino acids in the corresponding positions in two sequences, other profile-based alignment algorithms mostly use the sequence information derived by family sequences. IV. RESULTS AND DISCUSSIONS To evaluate the performance of the alignment scheme described in this paper, we tested it within the whole database. There is a reference sequence whose structural distance between others in one domain cluster is the

smallest. Also we selected the sequence whose structural distance is the remotest to the reference as the benchmark. Then the benchmark sequence was searched by our profile-based alignment algorithm with the whole database. With the statistics information got from the database, we classified the domain clusters into four types: sequence and structure conserved; structure conserved; sequence conserved and mixed. The conservation is defined as the number of amino acid type or 3D coordinates cluster less than half of the total number at each position in the alignment. The sequence and structure conserved, is the domain cluster whose amino acid type and 3D coordinates are both conserved; the structural conserved, is the domain which only the 3-D coordinates are conserved; the third one is only the amino acid type accord with the conserved condition; the last one consists of both amino acid conserved parts and 3D coordinates conserved parts. We listed the number of domain cluster in each type, as shown in table II. TABLE II. THE NUMBER OF DOMAINS CLUSTER IN EACH TYPE. Class type Number of Domain cluster sequential and structural conserved 05 structural conserved 28 the amino acid type number to weight on that type. Using these 3 different alignment methods, we compared the query (total,05 datasets) with the entire database, table III shows the number of hits and false for each method. From the table we can see that profile-based method improves the alignment significance. Using the consensus sequences aligned by Smith-Waterman algorithm, we can only got 788 (~25%) hits in this cluster type. It has high false rate. Sequence-based profile brings the sequence information into the profile and scoring scheme, it improves the hit rate up to 88%, but there still remain 22 false hits. Our profile contains not only the sequence information but also the structure information, so it can improve the hit rate up to 9%. TABLE III THE NUMBER OF HITS AND FALSE FOR EACH METHOD hits false Smith-Waterman 788 (75%) 263 (25%) Sequence-based profile alignment 929 (88%) 22 (2%) Combined profile alignment 952 (9%) 99 ( 9%) A sequential conserved 784 mixed type 974 We picked up some domain clusters from each type to evaluate our score scheme of alignment algorithm. In each domain cluster, we selected a query and then aligned it to the whole database. A. Sequence and Structure Conserved Domain It can be said that domain cluster in this type is the most conserved one. Within this type, our scoring scheme can reflect the conservative features. To evaluate the alignment significant, we compared our profile-based alignment with pure Smith-Waterman sequence alignment and sequence-based profile alignment algorithms. The score matrix in Smith-Waterman algorithm is BLOSUM62. The gap open and gap extension is 2 and 2 respectively. The sequence-based profile alignment is one normal profile-based alignment. It builds the profile by counting B C Fig 2. (A), a segment of the multiple structure alignment in cluster IPR00029. (B), the relevant structure superposition. (C), the alignment between query and cluster IPR005905 by Smith-Waterman algorithm. Number of Entries 300 2 200 00 0 8 5 22 29 36 43 57 64 7 78 85 92 99 06 3 20 27 34 4 48 55 Score Fig 3. Distribution of alignment scores for comparing a query from IPR00029 with the whole database.

Fig.2 and Fig.3 demonstrate another example that our method has more sensitivity than other two methods. Here, we chose a query, labeled 2dln_248_276, from the domain cluster IPR00029. Fig. 2A and 2B show a segment of the multiple structure alignment and the relevant structure superposition in the cluster. We can find that these domains very similar in structure level but have some difference in sequence level. Also, we note that there is a domain in cluster IPR005905 has a segment, which is sequential identity with the query, as shown in Fig. 2C. So both Smith-Waterman and sequence-based profile alignment identified the query belongs to cluster IPR005905. However, our profile-based method can distinguish the query form other clusters. Fig.3 shows that the alignment scores for comparing the query with the whole database. In this figure, the highest score is 60, which is the alignment score between the query and the profile of cluster IPR00029, the right domain cluster. So using our profile, we can improve significantly the alignment sensitivity. the results with the structure information. We chose a query, labeled blba_75, form the domain cluster IPR00064. Fig.4A shows the structure superposition in the domain cluster. Since the amino acid type in some positions is variable, both Smith-Waterman and sequence-based profile alignment methods gave the highest score to the consensus of cluster IPR0024. Although some segment in the alignment were matched well, as shown in Fig.4B, the result was wrong. However our profile-based alignment method can give the highest scores to the consensus of right domain cluster, as shown in Fig.5. The highest score is 9, which is the alignment score between the query and the profile of cluster IPR00064. B. Structure Conserved Domain The domain in this type is only structural conserved. The structure topology in one domain cluster takes on the same shape. But their amino acid type in some position is variable. In biology the amino acid type can be mutated while structure and function is the same. This phenomenon is difficult to handle with sequence alignment schemes, such as local, global or sequence-based profile alignment. TABLE Ⅳ THE NUMBER OF HITS AND FALSE FOR EACH METHOD hits false Smith-Waterman 20 (7%) 8 (29%) Sequence-based profile alignment 23 (82%) 5 (8%) Combined profile alignment 27 (96%) ( 4%) In this cluster type, we tested the 3 kinds of alignment methods, the hits and false results were shown in table Ⅳ. Although there are only 28 domain clusters in this type, the results still show that our profile-based method can improve the hit rate. Because the profile reflects the characters of a family, sequence-based profile method improves the hit rate a little. Furthermore, combined profile method improves Fig. 4. (A), the structure superposition in cluster IPR00064. (B) the alignment between query and consensus of cluster IPR0024. Number of Entries 300 2 200 00 0 6 6 2 26 3 36 4 46 5 56 6 66 7 76 8 86 9 96 0 06 6 Score Fig. 5. The distribution of alignment scores for comparing the query from cluster IPR00064 with the whole database. C. Sequential Conserved Domain TABLE V THE NUMBER OF HITS AND FALSE FOR EACH METHOD hits false Smith-Waterman 665 (85%) 9 (5%) Sequence-based profile alignment 735 (94%) 49 ( 6%) Combined profile alignment 745 (95%) 39 ( 5%) There are 784 domain clusters belong to this type in our database. Table V shows the results to compare the 3 kind

of alignment methods. Here we selected a query from domain cluster IPR00356. Fig.6A shows that there are some variable regions in these domains, and Fig.6B shows that the sequences are more conserved. Fig.6C shows the highest alignment score is 57, between the query to the profile form cluster IPR00356, whereas other two methods implied that the query belongs to the cluster IPR00707. Therefore, our scheme proved again to improve the alignment results and sensitivity. veracity through combining the sequence and structure information, although there is only one percent improve in hit rate than sequence-based profile alignment. We selected randomly a query from cluster IPR004227 in this type. Fig. 7 shows the distribution of alignment scores for comparing the query with the whole database. The highest score implies that the query belongs to cluster IPR004227. This figure shows again that our method have more sensitivity than other 2 methods. TABLE VI THE NUMBER OF HITS AND FALSE FOR EACH METHOD hits false Smith-Waterman 428 546 Sequence-based profile alignment 936 38 A Combined profile alignment 94 33 2 200 Number of Entries 00 0 B 5 9 3 7 2 25 29 33 37 4 45 49 Score 53 57 70 74 78 722 726 730 Number of Entries 300 2 200 00 0 4 7 0 3 6 9 22 25 28 3 34 37 40 43 46 49 52 55 Score Fig6. (A), the structure superposition and (B) multiple structure alignment in domain cluster IPR00356. (C). the distribution of alignment scores for comparing the query from cluster IPR00356 with the C whole database. D. Mixed Type Table Ⅵ shows the result comparison of three methods using the mixed type datasets. Because there are some variable regions in sequence level in this kind of domain cluster, the Smith-Waterman algorithm behaves much worse than others. Our profile-based method can improve the Fig 7. The distribution of alignment scores for comparing the query from cluster IPR004227 with the whole database. V. CONCLUSION AND FUTURE WORKS In this paper we proposed a profile-based alignment algorithm, used to our domain-based template database. The statistics analysis shows that most of the domain clusters in our database are conserved both in structural and sequential level, so each element in our profile combines the structural clustering information and the sequence information. With this profile, we developed a profile-based query-template alignment method. To validate if our method is more accurate and sensitivity than other query-template alignment methods, we divided our database into four types, based on sequence and structure conservation. In each type, we made some experiments. The results form each type show that our profile can accurate describe the feature of that domain cluster, as well

as, our profile-based method can align the query to right template with low-fault. It show that our method have more sensitivity than other query-template alignment methods. As described above, our final goal is protein structure prediction. So, how to use our domain-based template database and our profile-based query-template alignment method to improve the prediction of protein structure will be investigated in our next work. ACKNOWLEDGMENT This work was supported by the National Natural Science Foundation of China project under 603060 and key project under 906209. REFERENCES [] Needleman S, Wunsch C, A general method applicable to the search for similarities in the amino acid sequence of two proteins, J Mol Biol, 997, vol.48, p443-453. [2] Smith T, Waterman M, Identification of common molecular subsequences, J Mol Biol, 98, vol.47, p95-97. [3] J. Thompson, D. Higgins, and T. Gibson, CLUSTALW: improving the Sensitivity of Progressive Multiple Sequence Alignment through Sequence Weighting Position Specific Gap Penalties and Weight Matrix Choice, Nucleic Acids Res, 994, vol. 22, p.673-690. [4] Michael Brudno, Michael Chapman, Berthold Gottgens, Serafim Batzoglou and Burkhard Morgenstern, Fast and sensitive multiple Res, 2005. 33(Database Issue): p. D20-205. [0] Berman, H.M., et al., The Protein Data Bank. Acta Crystallogr D Biol Crystallogr, 2002. 58(Pt 6 No ): p. 899-907. [] Zdobnov, E.M. and R. Apweiler, InterProScan--an integration platform for the signature-recognition methods in InterPro, Bioinformatics, 200. vol.7(9), p. 847-848. [2] Bateman, A., et al., The Pfam protein families database. Nucleic Acids Res, 2004. 32(Database issue): p. D38-4. [3] Murzin, A.G., et al., SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol, 995. 247(4): p. 536-540. [4] Letunic, I., et al., SMART 4.0: towards genomic data integration. Nucleic Acids Res, 2004. 32(Database issue): p. D42-44. [5] Haft. D.H., J.D. Selengut, and O. White, The TIGRFAMs database of protein families, Nucleic Acids Res, 2003, 3(), p.37-373. [6] Holm. L. and C. Sander, Protein structure comparison by alignment of distance matrices, J Mol Biol. 993. 233(), p. 23-38. [7] Shindyalov IN, Bourne PE, Protein structure alignment by incremental combinatorial extension (CE) of the optimal path, Protein Engineering, 998, vol. (9), p739-747. [8] Pearson. W.R, Rapid and sensitive sequence comparison with FASTP and FASTA, Methods Enzymol, 990, vol.83, p. 63-98. [9] S. F. Altschul, W. Gish, W. miller, E. W. Myers and D. J. Lipman, Basic Local Alignment Search Tool, J. Mol. Biol. 990. 25, p403-40. [20] Gribskov, M., McLachlan, A.D., and Eisenberg, D, Profile analysis: Detection of distantly related proteins, Proc. Natl. Acad. Sci, 987 vol.84, p4355-4358. alignment of large genomic sequences, Bioinformatics 2003, vol.4, p 66-78. [5] C. Notredame, D. Higgins, J. Heringa, T-Coffee: A novel method for multiple sequence alignments, J Mol Biol, 2000, vol.302, p205-27. [6] Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z. and et al, Gapped BLAST and PSI-BLAST: A new generation of database programs, Nucleic Acids Res, 997, vol.25, p3389-3402. [7] SR Eddy, Profile hidden markov models, Bioinformatics, 998, Vol 4, p755-763. [8] Fa Zhang, Jingchun Chen, Zhiyong Liu and Bo Yuan, The construction of Structural Templates for the Modeling of Conserved Protein Domains, International Conference on Bioinformatics and its Applications(ICBA 04), Fort Lauderdle. Florida. USA. [9] Mulder, N.J., et al., InterPro, progress and status in 2005. Nucleic Acids