Identification of Representative Protein Sequence and Secondary Structure Prediction Using SVM Approach

Size: px

Start display at page:

Download "Identification of Representative Protein Sequence and Secondary Structure Prediction Using SVM Approach"

Stephanie Lambert
6 years ago
Views:

1 Identification of Representative Protein Sequence and Secondary Structure Prediction Using SVM Approach Prof. Dr. M. A. Mottalib, Md. Rahat Hossain Department of Computer Science and Information Technology (CIT), Islamic University of Technology (IUT), Board bazar, Gazipur-1704, Dhaka, Bangladesh Abstract In recent years representatives (e.g. RS126, CB513) of the protein sequences from the protein data bank (e.g. RS126, CB513) are used to predict the secondary structure of the protein. This structure describes the 3D structure of the protein. 3D structure defines the proper functionality of the protein so that it can be used to discover new drugs more accurately. This paper proposes a new method for identifying representatives of protein sequence and SVM (Support Vector Machine) has been used that can be utilized for predicting secondary protein structure more accurately and exactly. This paper uses CATH database as protein data bank. It uses protein sequences divided into 1110 super-families. Proposed method identifies 1110 representatives from these super-families by using Partitioning algorithm. Keywords: Protein, Secondary Structure, Representative, CATH database, SVM (Support Vector Machine) 1 Introduction Prediction of secondary structure of protein is very important for discovering new drugs. Secondary structure of a protein represents the 3D structure of the protein that describes the characteristics of the protein. Based on the characteristics, proteins are used to discover new drugs. So it is important to know the secondary structure of the unknown protein. At present, there exists a severe information asymmetry between the researchers who are working on the sequencing of organism genomes and those in elucidating the 3D structure of biomolecules. On one hand, there are more and more genomes being sequenced, and on the other hand, protein structure information accumulation at Protein Data Bank (PDB) (Berman et al., 2002) is growing very slowly. Since the structural determination in PDB is heavily relied on the experimental methods such as x-ray crystallography and NMR. These methods are accurate but it is expensive and time consuming. It takes long time to determine the structure of one unknown protein sequence. So it is impossible to use this method for all the sequences in PDB. Because of this, protein structure prediction by homology modeling or computer simulating is therefore emerging as an alternative or complementary approach by using small domain from the total protein sequences of any protein database. There are proteins in the CATH v3.0.0 database. These are divided into Classes and Architectures. Architecture has topologies. Proteins those share similar secondary 3D structure are grouped in a single topology. The whole protein world is divided into such 1110 topologies. Basically these 1110 topologies represent the whole protein world and its easier to sample and work with them instead of the proteins. [1] In this research a method is proposed for finding a list of protein that will represent each of these topologies, namely representative of each topology. To find a representative from a topology first to identify the ideal characteristics present in that topology that represents all the proteins under that topology. Then using those ideal characteristics, a model need to be built that will match with all the proteins to find those ones that hold the most ideal characteristics to consider them as a candidate. From the candidate list Final representative is selected. Our algorithm does the whole process. 2 Existing Approaches The Rost & Sander dataset [2] which has been used to train several early secondary structure prediction methods like PHD [2] and PREDATOR [3] contains 126 proteins that share a pairwise sequence identity of below 25%. But, as shown by Cuff and Barton [4], it contains some clear homologues if homology is not only measured on a simple sequence identity basis. Based on their studies Cuff and Barton proposed a set of 513 protein chains in 1999 [4] (in the following referred to as CB513 set). This set contains the 117 non-homologous chains of the Rost and Sander (RS117) set as well as 396 additional protein chains (CB396) carefully selected to exclude homologues as far as possible. The structural resolution of all proteins in the set is better than or equal to 2.5 Å. In total the set contains amino acid residues, with a helix content of 34.5%, a sheet content of 22.7%, and a coil content of 42.8%. For our studies we will use the CB513 protein set to do an initial training and testing of our prediction method as well as to optimize several parameters. Therefore, the CB513 set is split into the CB396 set which will be used for training and the RS117 set which will be used for testing purposes. Since the proteins in the CB513 set do not provide a comprehensive sampling of the fold space known today, our final prediction method is trained on a larger set of proteins. This final set is based on the SCOP [5] (Structural Classification of Proteins) database. Similar to a method to compile a protein dataset described by Cuff and Barton [4] we retrieved one representative protein chain for each of the 1290 SCOP superfamilies in release 1.65 (December 2003) from the ASTRAL [6] database. From this list, all superfamilies

2 containing only members whose structure was resolved by NMR technique or with a resolution worse than 2.5 Å as well as superfamilies belonging to the classes for transmembrane and multi-domain proteins (classes E and F) were removed, resulting in a final dataset of 940 protein chains that are a representative subset of the fold space known today. The final database, named SCOP-SFR, contains 219 all-alpha, 202 allbeta, 190 α/β, 281 α+β and 48 small proteins with residues in total and a helix content of 36.79%, a sheet content of 22.78% and a coil content of 40.42%. A fair comparison of our approach with other methods is difficult, since for most methods it is not published which proteins were used to train the method in its current state. Therefore only a blind test comparison provides a fair testing setup when comparing our method against others. This need is supplied by the EVA server [7] which provides a set of proteins which were published recently together with blind test secondary structure predictions of those proteins made by important prediction servers like PSI-PRED and PHD. Since the SCOP version 1.65 has been released in December 2003, all proteins added to the EVA server after that time (01. January November 2004) are blind test examples for our prediction method which is trained only on proteins available in the SCOP 1.65 release. This result in a test set of 105 proteins (named EVA105), with 9837 residues in total, for which predictions of PSI-PRED, PHD and PROF-SEC are available. 3 Method and Data Preparation An authentic reference dataset is needed for any learning method and in this paper all the data are collected from CATH v3.0.0 database, where 86,151 domains are divided into 1110 topologies. The CATH database is a hierarchical domain classification of protein structures in the Protein Data Bank [1]. The database has four major levels of hierarchy or organization: Class, Architecture, Topology (fold family) and Homologous superfamily [8]. In order to generate training dataset for the SVM classifier, protein sequences need to be collected in such a manner where the sequence represents each topology. In standard method for predicting secondary structure, there are three major parts which are listed below, 1. Generate datasets or representatives of primary sequences from databank 2. Sequence to Structure layer including the detection of frequent amino acid patterns, the feature representation of those patterns, the selection of interesting features and how we use the features to predict the secondary structure of a protein with unknown structure. The sequence to structure prediction method developed in this thesis employs Support Vector Machines in a similar way as discussed by Liu et al. [15] and Ward et al. [16]. 3. Features and methods employed in Structure to Structure layer. 3.1 Representative Protein Extraction Identifying the protein sequences which hold or represent the characteristics of a given topology is the first step for the analysis. For this a model needs to be developed that will match with all the proteins to find the ones that hold the most ideal characteristics to be considered as a candidate. From the candidate list the final representatives are selected, where domains are divided into 1110 topologies. Each topology contains a certain number of proteins; let that be N. Dividing each protein sequence of length L into three equal or almost equal part in terms of length produces three segments Ψ 1, Ψ 2, Ψ 3. Where L1, L2 and L3 is the length of the segment Ψ 1, Ψ 2, Ψ 3 respectively and L=L1+L2+L3. Let R= LMOD3, then L1=L2=L3-R. Total Primary Protein Sequences in CATH database : Total Topologies in CATH database : 1110 So, the target is to find 1110 representatives from Primary Protein Sequences. Total Frequency Matrix Generation (For each Topology) 1. In each topology: Topology: Helicase, Ruva Protein; domain 3 Total Primary Protein Sequences in this Topology : 370 i. Divide each Primary Sequence into 3 equal length parts: Primary Protein Sequence : PAPTPSSSPVPTLSPEQQEMLQAFSTQSGMNLEWSQKC LQDNNWDYTRSAQAFTHLKAKGEIPEVAFMK Length : 69 3 Equal Length Part : 69/3=23 Another If the Length (for example 14) is not divisible by 3 then 14/3=4, and Remainder is 2. Part 1: 4 Part 2: 4 Part 3: 4+Remainder (2) =6. So, we get From: PAPTPSSSPVPTLSPEQQEMLQAFSTQSGMNLEWSQKC LQDNNWDYTRSAQAFTHLKAKGEIPEVAFMK (Length: 69) To: Part 1 (23) : PAPTPSSSPVPTLSPEQQEMLQA (Ψ1) Part 2 (23): FSTQSGMNLEWSQKCLQDNNWDY (Ψ2) Part 3 (23): TRSAQAFTHLKAKGEIPEVAFMK (Ψ3) ii. For each part of sequences count the number of appearance of each of 20 Amino Acids Part 1 : PAPTPSSSPVPTLSPEQQEMLQA Table 1 Number of Appearances of 20 Amino Acids: i. G: 0 xi. S: 4 ii. A: 2 xii. T: 2 iii. P: 6 xiii. C: 0 iv. V: 1 xiv. N: 0 v. L: 2 xv. Q: 3 vi. I: 0 xvi. K: 0 vii. M: 1 xvii. H: 0 viii. F: 0 xviii. R: 0 ix. Y: 0 xix. D: 0 x. W: 0 xx E: 2

3 Calculate this for other 2 Parts also. iii. The binary characteristics matrix B for a segment (e.g. Ψ 1, Ψ 2, Ψ 3 ) of a particular sequence has dimension i*j where j is the 20 amino acids and i is the number of occurrence of an amino acid in the specific segment. That means store Number of Appearance of all 20 Amino Acids into a Binary Matrix. Row (Number of Appearance), Column (20 Amino Acids) B can be defined as 1if jth amino acid occurs i times B ij = { (1) 0 otherwise Thus there will be three such matrices for a particular sequence. Combining each segment (e.g. Ψ1, Ψ2, Ψ3) individually for all sequences within a topology the Total frequency matrices are obtain. There will be three such Total Frequency Matrices T for the entire topology with the dimension same as B matrix and they are defined as T ij = Bij (2) ij A weight matrix W is then generated for each of 3 T matrices with the same dimension as T matrix. It can be defined as Wij = ( Tij / N) *100 (3) Now using the equation (2) and (3) the Total characteristics matrix TC is generated for each segment (Ψ1, Ψ2, Ψ3). This TC matrix holds all information about the characteristics of that topology. Therefore there will be three matrices for the entire topology with the same dimension as T matrix and they are defined as Tij* Wij for i = 0 TCij = (4) Tij * W ij*i(number of occurance) otherwise From these three TC matrices three Binary Ideal Characteristics matrices IC is generated for the topology, which will represent the ideal/representative characteristics. IC matrix can be defined as 1if TCij holds the highest value of column j IC ij = { (5) 0 otherwise Finally these IC matrices are used for the selection of candidate protein for representative of topology. Using strict/loose matching technique each three segments of each of the proteins sequences is matched with the one corresponding IC matrix of three. The sequence that matches most with the IC matrices is selected as a candidate to be representative of that topology. In a strict matching technique the binary characteristics B matrix of each sequence is compared with the IC matrix and checked whether they are identical i.e. mismatch error is 0. A mismatch counter is calculated and the mismatch threshold is set to maximum six. For calculating mismatch error for a protein sequence using strict matching technique comparison between B matrix and IC matrix for each segment is required (e.g. B matrix of Ψ1 with IC matrix for Ψ1 and so on ) : (6) Then the Ψ1 is calculated by summing up all the j value of all 20 amino acid column of the Ψ1 segment. Similarly Ψ 2 and Ψ3 are calculated for Ψ2, Ψ3. (7) To calculate the total mismatch error across all the three segments for a protein sequence (8) If only one sequence with min<=6 is found then it is considered as the representative of that topology, if there is more than one sequence found with the same min a candidate list is generated. From that candidate list one is selected as representative of that topology. If no sequences come across to meet within the threshold value a loose matching approach is then followed. In a loose matching approach the threshold value is increased to sixty (3 Matrices x 20 Amino Acids = 60). To calculate mismatch error for a protein sequence using loose matching technique comparison between B matrix and IC matrix for each segment is required (e.g. B matrix of Ψ1 with IC matrix for Ψ1 and so on): (9) Then the Ψ1 is calculated by summing up all the j value of all 20 amino acid column of the Ψ1 segment/part. Similarly Ψ2 and Ψ3 are calculated for Ψ2, Ψ3. (10) To calculate the total mismatch error across all the three segments for a protein sequence (11) If only one sequence with min is found then it is considered as the representative of that topology, if there is more than one sequence found with the same min a candidate list is generated. Then we have to refine this candidate list using refining process. The refining process is an iterative process. Now iterate the whole process until now (Calculating IC matrix of a topology, strict matching/ loose matching) N=number of proteins in the candidate list Iterate the whole process on the list of N number of protein found in candidate list If the number of candidates in the list reduces per iteration continue refining iteration If the number of candidates in the list doesn t decrease by 2 consecutive iterations, consider it as Final Candidate list. If only one sequence in the candidate list then it is considered as the representative of that topology, if there is more than one select one of them as representative of that topology. 3.2 Feature Extraction According to the CATH database the 1110 topologies are divided into 4 divisions based on their structural types: Mainly Alpha (α), Mainly Beta (β), Alpha Beta (α-β) and Few Secondary Structures (fss). Because the representative sequences, which are selected during the previous stage (Representative Protein Extraction), were initially collected from CATH database, their corresponding structural information can also be obtained from the database itself. This aspect helps the process of preparing model for the SVM. Taking four amino acids at a time from 20 different amino acids all possible combinations are constructed. So a total of

4 160,000 different quadramer are created (20*20*20*20). Taking all representatives of a particular structural type a Current list of representative is created while the representatives of other 3 structural types are stored in Other list. The frequency of each of this quadramer in both list is then calculated. Let fci,j be the frequency of ith quadramer in jth candidate sequence from the Current list and foij be the frequency of ith quadramer in jth candidate sequence from the Other list. A difference matrix (Diffi) is computed based the the absolute difference between the number of occurrences of those 160,000 possible quadramers of amino acids in Current and Other list. Diffi= fci,j - foij Where, Diffi is the absolute difference of the occurrence of the ith quadramer. Sorting Diffi, in descending order and first 4000 quadramers with the highest frequency are selected as feature set for SVM training. For normalization of values of these features the minimum and maximum frequency of each of these top 4000 quadramers from both Current and Other list are taken into account. So a normalization parameter normi can be obtained according to the following equation. 3.3 SVM Training SVM, a supervised machine-learning technique has been used for computational biological problems as it can handle computationally expensive and noise data in a very efficient way which occurs very frequently in biology. It can also solve multi-class classification problems using the structural minimization principle. Given a training set in a vector space, SVM can find the best decision hyper plane, which separates two classes. The quality of the decision hyper plane depends on the difference margin between the two hyperplanes defined by the SVM [13, 14]. For the SVM training of the proposed method the Libsvm (version-2.84) is used [9]. Libsvm is available free, simple, easy-to-use, and efficient software for SVM classification and regression. It is also fast and memory efficient implementation of a SVM. The training is done using the 1110 representatives from the CATH database selected thorough the representative selection procedure discussed before. These representatives are classified in four secondary classes. The training and testing of the proposed method is done in two ways, one-against-one classification and oneagainst-others classification or multi-class prediction [10]. Table 2 The training data set built by taking the representative protein sequences from each T-level fold family of 1110 total topology families Classes Number of fold families Range of chain length Α Β α- β Fss Result and Analysis (12) In the One-against-One classification one is trained on data from two classes [11]. A binary classifier is constructed which maps examples of one of the class to +1 and the other to 1. The prediction accuracy of each one-against-one classification using proposed method is shown in Table 2. SVM parameter S=1 and T=2 are used for this test. The prediction accuracy of each one-against-one classification using existing best known SVM method is shown in Table 3[12]. Table 3 Prediction accuracy of one-against one classification of proposed method vs existing method Accuracy Classifiers (%) Proposed Existing Method Method[12] α vs. β α vs. α β α vs. fss β vs. α β β vs. fss α β vs. fss Average Accuracy Table 4 The binary classification accuracies of class folds (one-against-other) of proposed method vs existing method Classifier α vs. other (including β, α β and fss) β vs. other (including α, α β and fss) α β vs. other (including α, β and fss) fss vs. other (including α, β and α β) Accuracies of dipeptide frequency (%) Proposed Existing Method Method[12] Average accuracies Four One-against-Others classifications are used in this proposal and the prediction accuracy displays promising result for most of them. The results are presented in Table 3. Two pair of optimized parameter such as (S=1, T=2 for α vs. other and α β vs. other) and (S=0, T=2 for β vs. other and fss vs. other ) are used for this test. From these two pair we the best accuracy for the multi-class is taken. The classifier fss vs. others gives the highest prediction accuracy, being about 92% and The β vs. others also gives good accuracy, in the range of 82% 83% for the parameter (S=0,T=2). The classifier α vs. others, for example, only give about 84% accuracy, and the another classifier α β vs. other gives around 88% accuracy for the parameter (S=0, T=2). The Table 4 displays remarkable result for all four classes of protein. Also 150 random protein sequences were collected and tested using the proposed model, which is generated

5 during the SVM training process. Each class was tested against the proposed model of that respective class. Table 5 shows the test result for 150 random sequences using proposed model. Table 5 Test result for 150 random sequences using proposed model Classifier Correctly Identified (Out of 150) Accuracy (%) Mainly α Mainly β α β Fss Average Accuracy Future Development So far for the characteristics extraction for each topology we ve divided each protein sequence of topologies into three equal parts. For betterment of result in this sector we can divide each protein sequence into three variable parts. For making the candidate list for the representative of each topology and choosing the representative from the candidate list we can create a more complex decision model. For the feature extraction of each structural division we created and compared with a list of all possible amino acid sequence of length four. Using this length as five may increase the result quality. 6 Conclusion The results presented in this paper are the representatives of the topologies. From the result analysis, it is proved that these 1110 topologies will represent the whole protein world more accurately. These representatives can be used as benchmark like RS (126) and CB (513). So that each structure prediction method can be trained by these representatives and compare their result with each other. CATH database using SVM approach, no result or data is found regarding the prediction of random protein sequences, so that it is not possible to compare the proposed result of prediction with the existing one. But so far it is found that from any other prediction method, the proposed method will be able to accurately generate the secondary structure of protein at a satisfactory level (more than the existing accuracy of near about 60 70%). In addition to that the computation time will be reduced. References [1] Web URL: [2] Rost B. and Sander C. (1993) Prediction of Protein Secondary Structure at Better than 70% Accuracy. J. Mol. Biol., 232, [3] Frishman D. and Argos P. (1996) Incorporation of nonlocal interactions in protein secondary structure prediction from the amino acid sequence. Protein Engineering, 2, [4] Cuff, J. A. and Barton G. J. (1999) Evaluation and Improvement of Multiple Sequence Methods for Protein Secondary Structure Prediction. Proteins, 34, [5] Murzin A.G., Brenner S. E., Hubbard T. J. P. and Chothia C. (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures, J. Mol. Biol., 247, [6] Brenner S. E., Koehl P. and Levitt M. (2000) The ASTRAL Compendium for Protein Structure and Sequence Analysis, Nucleic Acids Res., 28, [7] Frishman D. and Argos P. (1995) Knowledge-based protein secondary structure assignment, Proteins, 23, [8] Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB, Thornton JM, CATH-A Hierarchic Classification of Protein Domain Structures, Structure, 1997, 5: [9] Chih-Chung Chang and Chih-Jen Lin, LIBSVM : a library for support vector machines, Software available at [10] Minh N. Nguyen, Jagath C. Rajapakse, Prediction of Protein Secondary Structure with two-stage multi-class SVMs, Genome Informatics 14, 2003, pp [11] KreBel, U., Pairwise classifcation and support vector machines, In Advances in Kernel Methods- Support Vector Learning, Cambridge, 1999, MA: MIT Press, pp: [12] X.-D. Sun and R.-B. Huang, Prediction of protein structural classes using support vector machines, Amino Acids (2006),Volume 30, Number 4 / June, 2006: [13] Rost B. and Sander C. (1993) Prediction of Protein Secondary Structure at Better than 70% Accuracy. J. Mol. Biol., 232, [14] Frishman D. and Argos P. (1996) Incorporation of nonlocal interactions in protein secondary structure prediction from the amino acid sequence. Protein Engineering, 2, [15] Liu Y., Carbonell J., Klein-Seetharaman J. and Gopalakrishnan V. (2004) Context Sensitive Vocabulary And its Application in Protein Secondary Structure Prediction, SIGIR 04, Sheffield, South Yorkshire UK. [16] Ward J. J., McGuffin L. J., Buxton B. F. and Jones D. T. (2003) Secondary structure prediction with support vector machines. Bioinformatics, 13,

Protein Structure: Data Bases and Classification Ingo Ruczinski

Protein Structure: Data Bases and Classification Ingo Ruczinski Department of Biostatistics, Johns Hopkins University Reference Bourne and Weissig Structural Bioinformatics Wiley, 2003 More References