PREDICTION OF PROTEIN BINDING SITES BY COMBINING SEVERAL METHODS

Size: px

Start display at page:

Download "PREDICTION OF PROTEIN BINDING SITES BY COMBINING SEVERAL METHODS"

Horace Daniels
5 years ago
Views:

1 PREDICTION OF PROTEIN BINDING SITES BY COMBINING SEVERAL METHODS T. Z. SEN, A. KLOCZKOWSKI, R. L. JERNIGAN L.H. Baker Center for Bioinformatics and Biological Statistics, Iowa State University Ames, IA C. YAN, V. HONAVAR Department of Computer Science, Iowa State University Ames, IA K. M. HO, C. Z. WANG, Y. IHM, H. CAO Department of Physics and Astronomy, Iowa State University Ames, IA D. L. DOBBS, X. GU Department of Genetics, Development and Cell Biology, Iowa State University Ames, IA We have combined four different methods for the prediction of binding sites in proteins. The methods used included: data mining approach by Support Vector Machines, threading through protein structures, prediction of conserved residues on the protein surface by analysis of phylogenetic trees, and the Conservatism of Conservatism method of Mirny and Shakhnovich. We have combined all these methods together by developing consensus predictions. Our results show that the combination of all methods improves the accuracy of prediction over the accuracies of prediction of individual methods. 1. Introduction Protein-protein interactions play a critical role in protein function. Completion of many genomes is being followed rapidly by major efforts to identify interacting protein pairs experimentally in order to decipher the networks of interacting, coordinated-in-action proteins. Identification of protein-protein interaction sites and detection of specific residues that contribute to the specificity and strength of protein interactions is an important problem 1-3 with broad applications ranging from rational drug design to the analysis of metabolic and signal transduction networks. The ability to make better predictions of protein binding sites can be used to enhance gene annotations by identifying specific protein binding sites within genes, i.e. at a finer level of 1

2 2 detail than whole protein identification. Experimental detection of residues in protein-protein interaction surfaces must come either from determination of the structure of protein-protein complexes or from functional assays, such as yeast two hybrid experiments. Computational efforts to identify interaction surfaces 4;5 have been limited to date, and need strengthening because experimental determination of protein structures in general, and protein-protein complexes in particular, lags behind the number of known protein sequences. Also, the relatively slow progress in high-throughput crystallography means that there is an increased urgency for improved computations in the protein structure field. In particular, computational methods for identifying residues that participate in protein-protein interactions will assume an increasingly important role 6;7 Based on the different characteristics of known protein-protein interaction sites, several methods have been proposed for predicting these sites using a combination of sequence and structural information. These include methods based on the presence of proline brackets, 8 patch analysis using a 6- parameter scoring function, 9 analysis of the hydrophobicity distribution around a target residue, 10 multiple sequence alignments, 11 structure-based multimeric threading, analysis of amino acid characteristics of spatial neighbors to a target residue using neural networks. Because structural information is not available for most protein complexes, development of reliable computational methods for identifying residues likely to participate in protein interactions, as is the intention here, - from amino acid sequence alone - is critical. By exploiting available databases of protein complexes, the data-driven discovery of sequence and structural correlates of protein-protein interactions will offer a potentially powerful approach. Our recent work has focused on the analysis of sequence neighbors of a target residue using SVM and Bayesian classifiers. None of the existing methods achieve predictions with accuracies higher than about 70%. There is an acute need for multi-faceted approaches that utilize available databases of protein sequences, structures, protein complexes, phylogenies, as well as other sources of information for the data-driven discovery of sequence and structural correlates of protein-protein interactions. 2. Methods 2.1. Initial dataset and definition of surface residues and interface residues. The initial dataset has been assembled from 155 unique proteins extracted from a set of 70 hetero-complexes in PDB. Two definitions of interface residues are considered in this research. The first definition is based on the reduction of

3 accessible surface area (ASA) of individual proteins upon complex formation, calculated using the DSSP program. The relative ASA of a residue is its ASA divided by its nominal maximum area as defined by Rost and Sander. 12 A residue is defined as a surface residue if its relative ASA is at least 25% of its nominal maximum area, except for cysteine whose threshold of the minimum relative ASA is lower. By this definition, 55% of the residues in a set of 115 proteins were surface residues, corresponding to a total of 12,676 surface residues. A surface residue is defined as an interface residue if its calculated ASA in the complex is less than that in the monomer by at least 1Å 2. Using these definitions, we obtained a dataset of 3,727 interface residues and 8,949 non-interface residues (i.e., surface residues that are not in the interaction sites). Thus, on average, interface residues represent 29% of surface residues, or 15% of total residues for proteins in our dataset Sequence and Structure Conservation. Amino acid sequences are conserved for many different reasons related to the structure and function of proteins: for stability, enzyme active sites, subunit interfaces, facilitating an essential motion (hinges), and binding sites. One of the most difficult problems is to develop methods to identify the reason for conservation for individual highly conserved residues. This is one of the reasons that a combination of approaches is more likely to permit identification of binding residues. Even identifying the conserved residues themselves is not completely straightforward. As will be seen (vide infra) different approaches will indicate individual residues being conserved to different extents. We intended to take advantage of this by using several methods for sequence and structure conservation. Our principal two methods for this purpose were based on phylogeny for sequence conservation and conservatism of conservatism for structure conservation that often identify different residues as being conserved Using Phylogeny to Identify Conserved Residues. Many computational tools have been developed for identifying amino acid residues that are important for protein function/structure, but there is no consensus regarding the best measure for evolutionary conservation. Evolutionary conservation can be decomposed into three components: (1) the overall selective constraints -- the number of changes observed at a site, (2) the pattern of amino acid substitutions the number of amino acid types observed at a site, and (3) the effect of amino acid usage. We established a reliable relationship between each measure and various aspects of structure. To explore the connection between sequence conservation and functional-structural importance, we proposed a new measure that can decompose the conservation into these three components. This measure is based on phylogenetic analysis. 3

4 4 The evolutionary rate at site k during lineage l from amino acids i to j (i,j=1, 20) can be expressed as λ kl (i,j) = c k a lk Q(i,j k), where c k accounts for the rate variation among sites, a lk for site-specific lineage (or subtree) effect caused by functional divergence, and the matrix Q(i,j k) is the (sitespecific) model for amino acid substitutions. The likelihood function for a given tree can be determined according to a Markov chain model. We have developed an integrated computer program that can map these predicted sites onto the protein surface to examine these relationships (DIVERGE at We used the solvent accessibility data from the DSSP to restrict the predicted conserved residues to those located on the protein surface Conservatism of Conservatism The phylogeny-based conservation of residues relies on sequence homology. It is well known, however, that many non-homologous proteins share similar folds. It is therefore highly desirable to study the conservation of residues in proteins based on the structural superimposition of non-homologous proteins. This is a different point of view regarding evolutionary conservatism from that in evolutionary considerations used for developing phylogenetic trees, but is complementary to phylogenetic studies, and the two together actually lead to better predictions of important protein binding residues. In order to obtain insight into the evolutionary conservation of residues in proteins we use the Conservatism of Conservatism method (CoC). The CoC method was developed by Mirny and Shakhnovich 13 for studying evolutionary conservation of residues in proteins with specific folds from the FSSP database. Proteins belonging to different biological families may have similar folds. Mirny and Shakhnovich 13 performed an analysis of conserved residues in several common folds. Residues (20 types) were subdivided into 6 different classes, in which, each class groups amino acids with similar chemical characteristics and frequencies of occurrence at different positions in multiple sequence alignments. The evolutionary conservatism within families of homologous proteins was measured through sequence entropy. Structural superimposition of different families of proteins with similar folds was used to calculate the so-called CoC for all positions of residues within a fold. We have applied this directly to our problem. We first calculated the sequence entropy at each position within a family of related 6 sequences from the HSSP database s() l = p ()log l p () l where p () l is the frequency of the class i of residues (for each of the six classes) at position l in sequence in the multiple sequence alignment. Then we used the FSSP database for the structural alignment. The structural superimposition of different families was used to calculate the conservatism-of-conservatism (CoC) i= 1 i i i

5 5 M 1 m Sl () = s() l where s m (l) is the intrafamily conservatism within the M m = 1 family m at position l. The CoC is the measure of the evolutionary conservation of the specific sites within the protein fold. The CoC method does not distinguish between residues at the protein surface evolutionarily conserved for functional reasons and residues inside the protein core that are conserved because of their importance to the folding process. We use the solvent accessibility data from the DSSP for the unbound molecules to eliminate conserved residues located inside the protein core Data Mining Approaches to Binding Site Identification. Recent advances in machine learning or data mining offer a promising approach to the data-driven discovery of complex relationships in computational biology. In essence, the data mining approach uses a representative data training set to extract complex a priori unknown relationships, e.g., sequence correlates of protein-protein interactions. Examination of the resulting classifiers can help generate specific hypotheses that can be pursued using molecular and biophysical methods. For example, a classifier that is able to identify binding residues on the basis of sequence or structural features can provide insights about sequence characteristics that contribute to important differences in function. The data mining approach to binding site identification consists of the following steps: Identify the surface residues from each protein. Label each residue from each protein as an interface residue or a noninterface residue based on appropriate criteria for defining residues in interaction sites. Use a machine learning algorithm, to train and evaluate a classifier to identify a target amino acid sequence residue as an interface or noninterface residue. Different types of information about the target residue (e.g., the identity and physico-chemical properties of its sequence neighbors, whether or not the target residue is a surface residue) can be supplied as input to the classifier. A variety of machine learning algorithms can be used for this purpose. Evaluate the classifier (typically using cross-validation) on independent test data (not used to train the classifier). Apply the classifier to identify putative interaction residues in a protein given its sequence (and possibly its structure, but not the structure of its interaction partner). Our studies have used the support vector machine (SVM) learning algorithm for this task because SVMs are well-suited for the data-driven construction of high-

6 6 dimensional patterns and are especially useful when the input is a real-valued pattern because SVMs are well-suited for the data-driven construction of highdimensional patterns and are especially useful when the input is real-valued. In addition, algorithms for constructing SVM classifiers effectively incorporate methods to avoid over-fitting the training data, thereby improving its generality, i.e., the performance of the resulting classifiers on test data. Support vector machine algorithms have proven effective in several applications, including text classification, gene expression analysis using microarray data, and predicting whether or not a pair of proteins is likely to interact Threading of Sequences through Structures of Surfaces. In methods , the properties of the protein-protein interface were deduced by concentrating on the sequence information contained in the protein pair under investigation. However, it is well accepted that the physical origin of the specificity of protein-protein interactions comes predominantly from their structures. Thus, in any thorough investigation of protein-protein interactions, it is essential to include information from structural studies. In this section, we emphasize predictions that are based on the structure of the protein-protein interface. It is known that the geometry of the interfaces can undergo substantial relaxation as a result of the binding process and protein docking studies using rigid templates are not fully successful. We adapted methods employed in protein structure recognition to the problem of protein-protein interfaces. In the first stage, structural models for protein-protein interfaces were generated from existing protein databank (PDB) structures by extracting portions of contacts between different protein chains. We found that if we define the interaction region by the criteria that backbone C α atoms on the two interacting chains are less that 15 Å apart, it gives reasonably connected fragments suitable for threading studies. In the second stage, having identified a set of candidate structures, threading studies are to be performed to examine the probability that a given model resembles the real interface. Through these studies, we identified parts of the two proteins that are in the interfacial region and predicted the most probable residue-residue contacts between the two proteins Ensemble Predictions for Combining Results from Multiple Methods. Different approaches to identify binding sites from amino acid sequence information yield different (sometimes contradictory, sometimes complementary) results. In such cases, approaches for combining results from multiple predictors have potential importance. The key idea is that results obtained using different approaches, which we shall call classifiers henceforth, may be correlated (or, more generally, statistically dependent) due to a variety of reasons including the use of common datasets for constructing or tuning classifiers, use of intermediate variables for encoding inputs to the classifiers,

7 and similarities between methods (e.g., SVM, neural networks). Regardless of the source of statistical dependency, the goal is to develop methods for weighting the output of each classifier appropriately for the purpose of producing more accurate predictions. Our method takes as input the binary (True/False) output of each classifier (e.g., SVM, CoC) and produces as output a probability that the residue under consideration is a binding residue, using the outputs produced by each of the classifiers. Algorithms for learning Bayesian (or Markov networks) can be then used to learn the network of dependences and the relevant conditional probabilities General Evaluation measures for assessing the performance of classifiers. Let TP denote the number of true positives -- residues predicted to be interface residues that are actually interface residues, TN the number of true negatives -- residues predicted not to be interface residues that are in fact not interface residues, FP the number false positives -- residues predicted to be interface residues that are not in fact interface residues, FN the number of false negatives -- residues predicted not to be interface residues that actually are interface residues. Let N = TP+TN+FP+FN. Sensitivity (recall) and specificity (precision) are defined for the positive (+) class as well as the negative (-) class. Sensitivity + = TP/(TP+FN), Sensitivity - = TN/(TN+FP), Specificity + = TP/(TP+FP), Specificity - =TN/(TN+FN). Overall sensitivity, specificity, and false alarm rate, correspond to expected values of the corresponding measures over both classes. The performance of the classifier is summarized by the Correlation Coefficient, which is given by ( TP TN FP FN ) /[( TP + FN )( TP + FP)( TN + FP)( TN + FN )] 1/2, ranges from -1 to 1 and is a measure of how predictions correlate with the actual data. 3. Results 7 For the results presented here we have used 7 proteins from the PDB, together with their sequence homologs. These proteins are 1acb_E (which abbreviates the chain E of the protein 1acb), 1fle_E, 1hia_A, 1avw_A, 2sic_E, 3sgb_E, and 4cpa. Figure 1 shows an example of Proteinase B from Streptomyces Griseus (3sgb). The protein chain 3sgb_E is shown in ribbons and the inhibitor 3sgb_I is shown at the top in wireframe. The residues of the chain 3sgb_E that are actual or predicted binding sites are marked with different colors: red - true positives, yellow - false negatives, dark blue - false positives.

8 8 Figure 1. Three-dimensional structure of Proteinase B from Streptomyces Griseus (3sgb) The target protein (proteinase 3sgb_E) is shown in ribbons, with residues color coded as: red - true positives, yellow - false negatives, dark blue - false positives. The inhibitor partner is shown at the top in wireframe Phylogeny We have used the ClustalX multiple sequence alignments of protein sequences to generate phylogenetic trees. We defined as the conserved residues those that were identical at a given position in the multiple sequence alignment in more than 85% alignments (i.e. only 15% mutations or gaps were allowed). We have found that the cutoff value 85% gives the optimal results. Because the phylogenetic trees don t always show much evolutionary diversity frequently there are many residues that satisfy this condition. Additionally we have filtered the results by removing conserved residues hidden inside the protein core, i.e., with low solvent accessibility. Table I show the results of the prediction of the binding sites for for Proteinase B from Streptomyces Griseus (3sgb_E) for five different methods:. 1 Phylogeny, 2 -CoC, 3 - SVM, 4 - Threading and 5 - Consensus. The first line contains information about the position of residues along the chain. The second line shows the amino acid sequence. The actual binding site residues are identified in the sequence line in reverse color on the basis of reduction in surface area upon binding. The next five lines show the predictions of binding sites by different methods. The predicted binding sites are marked by P (phylogeny) (not present for 3sgb_E and therefore the first line below the sequence line is always empty), C (for the CoC prediction), D (for the data mining prediction by SVM), T (for the prediction by threading) and E (for the consensus method). The results from Table I are summarized in Table II that shows the numbers of true positives (TP), true negatives (TN), false positives (FP) and false negatives (FN), overall sensitivity, overall selectivity and the correlation coefficient for each of the methods. Because the number of TPs and FPs for the 3sgb_E are zero the correlation coefficient (that ranges from 1 to +1) for the phylogenetic trees prediction is zero. Table III shows overall average results for the correlation coefficient, sensitivity and selectivity

9 for seven protein complexes studied in this work.. Two kinds of averages are considered < > the numerical average over the 7 proteins in the dataset, and < > n average over the number of residues. TABLE I. Example results for Proteinase B from Streptomyces Griseus (3sgb_E) for five different methods:. 1 Phylogeny, 2 -CoC,, 3 - SVM, 4 - Threading and 5 - Consensus. The actual binding site residues are identified in the sequence line in reverse color. The predicted binding sites are marked by P (phylogeny) (not present for 3sgb_E), C (CoC), D (data mining by SVM), T (threading) and E (consensus) ISGGXXXXXXXXXDAIYSSXXXXTGRCSLGFNVTYYFLTAGHCTDXGATTWWAXXXXXXX TTTT C D E T XXNSARTTVLGTTSGSXSFXXXXPNNDYGIVRYTNTTIPKDGTVGGQDITSAANATVGMA D E C C C CC C D T T E VTRRGSTTXXXXXXXXXXXXGTHSGSVTALNATVNYGGGDVVYGMIRTNXXXXXVCAGDS CC C CCCCCCC CC D D D D T TTTTT TT T E EEEEE E E GGPLYSGXXXXTRAIGLTSGGSGNCSSGGTTFFQPVTEALAYGVSVY CCCCCCCCCC C CCC D D D TTTTTTT TT EEEEEEEE E E Table II. Summary results for Proteinase B from Streptomyces Griseus (3sgb_E) 3SGBE TP TN FP FN Sensitivity Selectivity Correlation 1. Phylog COC SVM Thread Cons

10 10 Table III. Overall average results for selectivity, sensitivity and the correlation coefficient averaged over the 7 proteins in the dataset. Averages over the number of residues are denoted by < > n. Method <Selec> <Selec> n <Sens> <Sens> n <Cor> <Cor> n 1.Phylog CoC SVM Thread Cons Although, phylogenetic predictions are not the best ones, as seen from Table III, utilizing the phylogenetic approach in combination with other methods can lead to significantly improved predictions Conservatism of Conservatism To detect structurally conserved residues that are possible binding sites we used the Conservatism of Conservatism method (CoC) developed by Mirny and Shakhnovich. 13 We used the FSSP (Fold classification based on Structure- Structure assignment of Proteins developed by Holm and Sander, 1993) alignments for the studied set of 7 protein complexes. For each family the HSSP alignments were used to calculate the sequence entropy at each position of the alignment, and the structural alignment (FSSP) was used to calculate the value of CoC. All residues in the protein chain were ranked according to the values of CoC at a given position in the sequence. We have defined as conserved the top 3/4 of the total number of residues ranked according to their value of CoC. We then have filtered the results of the CoC ranking by removing all structurally conserved residues located inside the protein core. The results of the prediction for 3sgb_E are shown in Table I. Each C indicates a predicted binding residue. The overall performance of the CoC method is summarized in the second row of Tables II and III. The CoC method alone is clearly not sufficient to give a successful prediction of the binding sites, and combining this method with other prediction techniques in the consensus method gives improved results Data Mining for Binding Residues We generated a classifier to determine whether or not a surface residue is located in the interaction site using the sequence neighbors of the target residue. An 11-residue window consisting of the residue and its 10 sequence neighbors (5 on each side) was empirically chosen. Each amino acid in the 11 residue window is represented using 20 values obtained from the HSSP profile

11 ( ) of the sequence. The HSSP profile is based on a multiple alignment of the sequence and its potential structural homologues. Each target residue is therefore associated with a 220 (11 20) element vector. The SVM learning algorithm is given a set of labeled examples of the form (X, Y) where X is the 220 element vector representing a target residue and Y is its corresponding class label, either an interface residue or a non-interface residue. The learning algorithm generates a classifier which takes as input a 220 element vector that encodes a target residue to be classified and outputs a class label. Our previous study reported results for classifiers constructed using a combined set of 115 proteins belonging to six different categories of complexes: antibodyantigen, protease-inhibitor, enzyme complexes, large protease complexes, G- proteins, cell cycle signaling proteins, signal transduction, and miscellaneous. Our more recent study trained separate classifiers for each major category of complexes (e.g., protease-inhibitor complexes). In the case of protease-inhibitor complexes, 19 leave-one-out cross-validation experiments were performed. In each experiment, an SVM classifier was trained using a training set of surface residues labeled as interface versus non-interface residues from 18 of the 19 proteins. The resulting classifier was used to classify the surface residues from the remaining target protein into interface residues and non-interface residues. The predicted binding sites for 3sgb_E are marked in Table I by D. The performance of the resulting classifiers for the current test set of complexes is summarized in row 3 of Tables II-III. We always find that the classifiers trained on each category of complexes outperform the classifiers trained on the combined dataset, which is justification for why we are proposing to create separate databases for different categories of proteins Threading of Sequences through Structures of Surfaces We have performed threading studies for the set of 7 protease-inhibitor complexes. From each complex structure, we first extract the interfacial region as described earlier. The residue-residue contacts in the interfacial region are described with contact matrices. Threading was performed with the same algorithm that we used previously for single proteins in the CASP5 competition. For each sequence, threadings were performed with other structures in the database and alignment results used only when the score exceeds a threshold. From the alignments, the residues which are involved in contacts at the interface are identified and a scale generated depending on the number of times the residue is indicated and the strength of the threading score. The predicted binding sites for 3sgb_E by the threading method are marked in Table I by T. Rows 4 of Tables II-III summarize the predictions. It shows that the threadingbased approach is somewhat more successful than other methods in some categories, but the best performance comes by combining it with other prediction methods in the consensus method (rows 5). 11

12 Consensus Results and Discussion Based on the results of the predictions with the four independent methods we have developed a simple consensus method to yield the best prediction. In the present consensus approach, a site is considered to be an interaction site if: at least three independent methods predict it, or any two methods predict the binding site (except the Phylogeny-Threading pair), or if only a single method predicts the binding site this method happens to be SVM. The results of the consensus are marked in Table I by E, while the last rows in Tables II-III summarize the performance of the consensus method. Consensus results are generally improved, and our data (not shown here because of space limitations) show that the improvements in the predictions can be even more remarkable when the individual predictions of all four methods are relatively weak. This shows that an excellent approach for the prediction of the binding sites in protein complexes is to combine diverse prediction methods. Each of the four prediction methods presented in this paper sheds a different light on the conservation and prediction of the binding sites, but none of the methods taken separately is as powerful as a combination of all methods together. The simple consensus method presented here could be significantly improved by using the ensemble predictions with more detailed probabilities. This problem in more detail will be studied by us in the future. Acknowledgments The financial support through the NIH grant 1R21GM is acknowledged by V. Honavar, D.L. Dobbs and R.L. Jernigan References 1. Chothia, C.; Janin, J. Nature 1975, 256, Yan C.; Dobbs, D.; Honavar, V. Springer-Verlag: Berlin, 2003; pp Yan, C., Dobbs, D., and Honavar, V. "Predicting protein-protein interaction sites from amino acid sequence," Iowa State University Benner, S. A., et al.. J.Mol.Biol. 1994, 235, Bock, J. R.; Gough, D. A. Bioinformatics 2001, 17, Teichmann, S. A.; Murzin, A. G.; Chothia, C. Curr.Opin.Struct.Biol. 2001, 11, Valencia, A.; Pazos, F. Curr.Opin.Struct.Biol. 2002, 12, Kini, R. M.; Evans, H. J. FEBS Lett 1996, 385, Jones, S.; Thornton, J. M. J Mol.Biol. 1997, 272, Gallet, X,.et al, R. J Mol.Biol. 2000, 302, Pazos, F.et al. J Mol.Biol. 1997, 271, Rost, B.; Sander, C. Proteins 1994, 20, Mirny, L. A. and Shakhnovich, E. I, J Mol.Biol. 1999, 291,

CISC 636 Computational Biology & Bioinformatics (Fall 2016)

CISC 636 Computational Biology & Bioinformatics (Fall 2016) Predicting Protein-Protein Interactions CISC636, F16, Lec22, Liao 1 Background Proteins do not function as isolated entities. Protein-Protein