Automatic Epitope Recognition in Proteins Oriented to the System for Macromolecular Interaction Assessment MIAX

Size: px

Start display at page:

Download "Automatic Epitope Recognition in Proteins Oriented to the System for Macromolecular Interaction Assessment MIAX"

Toby Wheeler
5 years ago
Views:

1 Genome Informatics 12: (2001) 113 Automatic Epitope Recognition in Proteins Oriented to the System for Macromolecular Interaction Assessment MIAX Atsushi Yoshimori Carlos A. Del Carpio Toyohashi University of Technology, Ecological Engineering, 1-1 Tempaku, Toyohashi, Aichi , Japan Abstract In the present work we evaluate the performance of an algorithm for the automatic recognition of binding sites in proteins as well as in other macromolecules whose interactions are involved in many cellular and physiological processes. The algorithm is a combination of an unsupervised learning algorithm based on Kohonen self organizing maps to characterize the properties of patches of protein solvent accessible surfaces and a filtering algorithm to establish both the physical boundaries of the patches as well as the level of contribution of different and distant atoms involved in the interaction.we have found that the algorithm performs extremely well in a set of randomly selected protein complexes for which the interaction interfaces are extracted and compared with the results of the algorithm. A statistical evaluation of the algorithm is additionally performed by analysis of the degree of hydrophobicity and hydrophilicity of the output patches and comparison with that of the observed interface constituent amino acids. Keywords: epitopes, self organizing maps, hydrophobic cluster, protein interaction 1 Introduction In recent years the system for assessment of macromolecular interaction MIAX has been under development in our laboratories [2, 3, 4], the essential characteristics of the system being its potential function based on the scaled particle theory and a powerful configuration space search engine that allows flexible docking [4] of monomers known to interact but for which only structures at the isolated state are known. However, in spite of the quality of results output by the system, and several modules developed to get better processing times of the system such as its parallelization and execution on a parallel computer system, some obstacles still remain to be overcome in order to improve the performance of the system. One of this is the recognition of the putative interaction sites on the surface of the monomers in a way to reduce the hyper-dimensional configuration space searched by the hybrid algorithm. The present work aims at the evaluation of an algorithm developed in our laboratories directed to recognize epitopes in proteins for which the three dimensional structure is the only information possessed a priori. The algorithm is based on an unsupervised learning algorithm by means of which the characteristics of the surface of the proteins are identified (or any other macromolecule) and a filtering algorithm based on a two dimensional Fourier transform which combined with the learning algorithm assists in delimiting the clusters of surface atoms constituting the binding sites. Evaluation of the performance of the algorithm is carried out by firstly selecting a set of thirty protein complex structures recorded in the PDB. Values expressing the activity in function of electrophysical properties of the atoms are then computed on each point of the solvent accessible surface

2 114 Yoshimori and del Carpio for the molecule. These values are then processed with the learning algorithm in order to identify patterns of activity distributed on the surface. The filtering process is finally applied to delimit and differentiate those distribution patterns as described in the method section. The results output by the recognition algorithm are then compared with the actual interfaces of the monomers in the complexes. This evaluation is performed in two stages in the first of which two metrics introduced by Young et al. [10] are used to measure the overlap of the predicted and observed interfaces of the complex molecules. An additional evaluation is, however, introduced in the present work oriented at an automatic evaluation and corroboration of the results. Through this automatic analysis the correctness and/or incorrectness of the results is evaluated taking into account the characteristics of the amino acids to which the atoms composing the patches or surface clusters of atoms belong. This analysis is also oriented to the elucidation and establishment of underlying rules governing macromolecular interaction and protein interaction in special. In the present paper we focus the identification of protein epitopes considering fundamentally the hydrophobic characteristics of the protein surface regions. Among previously reported studies oriented to the identification of protein epitopes and active sites we can mention the work of Korn and Burnett [7] who calculated hydropathies for subsets of atoms on the surface of the protein that were in contact surfaces and non-contact surfaces. Using a normalized consensus hydropathy scale to express the hydropathy values of atoms they found that the hydropathy of the contact surfaces was higher than the exterior non-contact surfaces in multimeric proteins and 2 protein-protein complexes. Further development of these ideas was done by Young and coworkers [10] who suggested that clusters of hydrophobic residues could be obtained by adding hydrophobic parameters of residues within a certain distance from a solvent-accessible grid point. Ranking of the clusters of surface residues on the basis of the hydropohobicity of their constituent amino acids leads to a list of candidate interaction sites with the most putative being those ranked highly hydrophobic. By representing the solvent-accessible surface area by dots, Lijnzaad and co-workers [8] developed a method that identifies regions of a surface of arbitrary size and shape consisting solely of carbon and sulfur atoms. However, only those patches large enough to exceed random expectation are taken into account as meaningful; this being its advantage as compared to other techniques since the size of the region is not constrained by the methodology utilized to locate them. In fact, application of their technique to eight multi-meric proteins shows that, in general, only the largest patches coincided with regions of known biological relevance. Several algorithms have also been reported to predict protein-protein interaction sites using multiple characteristics of surface patches, a comprehensible review is given in [5]. The algorithm presented here, as already mentioned, deals with the identification of hydrophobic clusters on the surface of the molecule that are involved in expression of the function of the proteins. 2 Methodology Most of the cellular and physiological processes, including signaling to transcription, are performed and regulated by the interaction of macromolecules. Among them, protein interaction plays the utmost relevant role in molecular and cellular recognition involved in the majority of biochemical and biophysical processes in living organisms. Revealing the principles underlying this type of interactions is not only important to understand the mechanisms of the biochemical and physiological processes of living organisms but has also impact on the development of proteins with novel functions directed to several medical and biotechnology fields. Docking protein molecules is a way of assessing protein-protein interactions and computing the interaction energies offer a quantitative means of measuring the strength of the interaction and the stability of the resulting complex. The multidimensional configuration space, however, poses some

3 Automatic Epitope Recognition 115 difficulties to a rapid search of the optimal complex conformation, a preprocessing of the structures being necessary in order to identify sites enhancing interaction of the macromolecules i.e. those parts of the molecules possessing high activity and/or affinity as compared with others. An algorithm to perform the identification of these sites or epitopes has been developed in our laboratories and described elsewhere [4], and here we evaluate its performance in a series of protein complexes for establishing the conditions in which determined properties must be used instead of others to achieve that goal. 2.1 The Algorithm The algorithm proposed here is a combination of an unsupervised learning algorithm based on a self organizing map (SOM) [6] with a filtering methodology based on a two dimensional fast Fourier Transform. The module starts with the input of the molecular 3D structure for which prediction of the binding sites is required. Then the computation of the molecular solvent accessible surface (SAS) points is followed by the calculation of the electrophysical potential (related to activity) on every one of the computed points of the SAS. Here we used the hydrophobic potential on the surface of the molecule, the computation of which is performed by the expression [1]: MHP = E tri e (r i d i ) (1) where E tri is the transfer energy for several atom types, r i is the radius of the atom i and d i is the distance between atom i and an arbitrary point M. The next step consists in the application of the SOM algorithm to the set of position vectors of the points of the molecular solvent accessible surface. Then the two dimensional Fourier transform is applied to the set of neurons to which values of the physicochemical characteristic were previously calculated according to the molecular surface points each one includes (the average over all the point belonging to that particular neuron) [4]. Hydrophobic clusters are obtained taking the inverse Fourier transform of the spectrum at certain level of frequency, points at higher frequency being eliminated as noise in a common signal processing spectrum. Finally, the binding regions are displayed for visual inspection extracting their constituents at the atomic and amino acid level from the computed results. The flow of these calculations is shown taking as example the receptor and ligand polypeptides constituting the complex with PDB code 1CHO in Fig. 1. Here the epitope computation process is shown for each monomer composing the complex. In the first step the surfaces each molecular system is shown plotting all the points constituting the SAS of the molecule. The second step is the projection of the three dimensional point vectors of the SAS points by means of the SOM method. The third step shows the filtering process using the spectral method described, while the fourth illustrates the characterization of cluster of atoms as pattern on the SAS which is the epitope or active site of the molecule. Composition of each cluster is extracted easily from the composition of each patch automatically computed by the algorithm as shown in the same figure as a final step The Evaluation Procedure The main objective of the present paper is to evaluate the validity of the former algorithm and for this purpose we use an exhaustive statistical analysis of the clusters predicted by our methodology. This analysis reveals many characteristics of the interaction sites in proteins as well as the potential underlying it. The analysis is performed using two metrics similar to the ones suggested by Young et al. [10]. The fundamental difference being that the areas used in their analysis are points on the solvent accessible surface in this report. The relation of both physical values being proportional.

4 116 Yoshimori and del Carpio (a) Receptor (1CHO) Step1 Step2 Step3 Step5 Comparison Step4 (b) Ligand (1CHO) Step1 Actual Binding Site Step2 Step3 Step5 Comparison Step4 Actual Binding Site Figure 1: Hydrophobic Cluster Analysis Protocol: (a) Receptor (1CHO). (b) Ligand (1CHO).

5 Automatic Epitope Recognition 117 Accordingly, to perform this statistical analysis on the performance of the method, the following metrics having been adopted and modified to compare calculated and experimental binding site regions. M1 = (A actual A classx )/A actual (2) M2 = (A actual A classx )/A classx (3) where, A actual stands for the number of points in the actual intermolecular contact surface (region colored in red in Fig. 1 step5), A classx is the number of points belonging to one of the predicted clusters (regions shown with different colors in Fig. 1 step5), and X being the number identifying that particular cluster. These values are calculated for all the complexes processed here. Results for the isolated receptor are illustrated in Table 1(a), while those for the isolated ligands are shown in Table 1(b), where results at different levels of frequency for the inverse transform, and for every class obtained are presented. Since large values of M1 together with low values for M2 represent predicted clusters larger than the actual binding site, while large values of M2 with low values of M1 implicate larger experimental interfaces than the predicted clusters, an ideal result is that with comparable and high values for both M1 and M2. A further evaluation procedure is applied here to examine the properties of the hydrophobic clusters identified by the algorithm with those of the amino acids that compose them. The proposed semiquantitative evaluation consists in examination of the degree of agreement of the hydrophobicity of the clusters predicted with those of the amino acids involved in the clusters, a common hydrophobic scale being used for the amino acids. When a hydrophobic cluster is identified as the epitope for a determined molecule the hydrophobic properties of the amino acids constituting the cluster are analyzed. Whether the clusters are constituted mainly by hydrophobic amino acids or by hydrophilic amino acids when they are identified as hydrophobic leads not only to ascertain the correctness of the answer in the case of coincidence but also to endorsement of the answer when a weak hydrophobic cluster has been found or none at all. 3 Results The self organizing algorithm was applied to project the multidimensional distribution of the hydrophobic potential in the two dimensional space using 2000 iteration for each set of points constituting the surfaces of both monomers (which we call receptor and substrate or ligand in what follows for clarity reasons). Data is presented for the six biggest clusters that are obtained with an inverse Fourier transform at a frequency level of 5. That is, when only frequencies lower than 5 are considered in the analysis of the cluster boundaries. As stated previously the algorithm was tested with a set of 30 protein complexes shown in Table 1 where each entry shows the number of clusters identified (Num class), values of the M1 and M2 metrics for the six most significant clusters, the area of each of the six classes (A class), the area of the experimental interface (A act), and the SAS of the protein (A all). Table 2 summarized the analysis of each cluster recording also the name of the complex, but also the total number of points composing SAS, the number of point in the actual interface as well as the amino acids constituting each observed interface. The M1 and M2 metrics are classified in three levels of magnitude: high, middle, and low, according to the degree of overlapping with the actual interface. The high level stands for M1 or M2 values higher than 50%, the middle for values lower than 50% and higher than 20% while values of lower than 20% belong to the lowest level. Table 3 summarizes the results for the thirty structures processed by the algorithm. In this table it can be noticed that the first hydrophobic clusters for 8 of the receptors and 12 of the substrates coincide with the actual interface with a high overlapping degree.

6 118 Yoshimori and del Carpio Table 1: SOM-FT Results for Complexes in Table 2: (a) Receptor. (b) Ligand. When predictions at the middle level are incorporated as correct predictions to the former, the clusters correctly predicted increase to 14 for the receptors and 21 for the substrates, representing a 46.7% of the total processed receptor structures and 100% for the substrates. These figures can be observed in the histograms of Fig. 2. Table 4 summarizes the percentage of amino acids detected in each interface for both the receptor and the substrate structures, careful observation of this table leads to detection of some tendencies in the use of determined amino acids in protein complex interfaces. Thus Arg is a preferred amino acid in the interfaces of both receptor and ligands, in spite of its high hydrophilicity while Leu a mild hydrophobic amino acid is also present with high frequency in both the receptor and ligand interfaces. Besides, Gly, Leu, Phe, Ser, and Tyr are the most frequently used amino acids in receptor interfaces, while Arg, Leu, Lys and Pro are present in the ligand interface with more than 8 % of frequency. A more exhaustive statistical analysis was carried here, therefore, considering the hydrophobic characteristics of the amino acids of the epitopes identified by the algorithm described so far. This analysis consists in ascertaining the nature of the identified clusters as well as the actual interface taking into account the hydrophobicity of the amino acids involved in both clusters of atoms. We exemplify the analysis for 4 randamly selected receptors and 4 ligands in Table 2. The histograms in Fig. 3 show the percentage of hydrophobic and hydrophilic amino acids in actual interfaces of 4 randomly selected receptors and ligands. It is easy to perceive that the automatic recognition algorithm presented here perceives correctly and with high degree of reliability the hydrophobic region as the interaction site of the protein. When the interface is composed of highly hydrophobic

7 Automatic Epitope Recognition 119 Table 2: Protein Complexes Used in the Evaluation of the SOM-FT Algorithm. Figure 2: Over all prediction rates expressed in function of M1 and M2 for each of the six clusters obtained using analysis.

8 120 Yoshimori and del Carpio Table 3: Summary of the SOM-FT Prediction results: (a) Receptor. (b) Ligand. amino acids, this is evident for the receptors 1PPF and 1HNE for which the cluster recognized contains a high percentage of hydrophobic amino acids. The hydrophobic amino acid content for the 1TEC is only six points higher than the hydrophilic content while the content of hydrophobic amino acids for 1FC2 is lower than the hydrophilic content. Accordingly in these two cases the reliability of the prediction by the algorithm is proportional to the content of hydrophobic amino acids. Therefore a relationship can be established between the content of hydrophobic amino acids and the clusters output by the algorithm. This relationship correlates the reliability of the prediction with the percentage of hydrophobic amino acids in the cluster. Whenever the property used to recognize the epitopes is the hydrophobicity and the content of hydrophobic amino acids is also high, then the prediction of the epitope coincides with very high probability with the actual interface. In contrast, when the content is relatively low, the coincidence with the actual interface is only partial. This fact leads to the conclusion that the algorithm presented here can be used to recognize the epitopes based on several properties, allowing the determination of the reliability of the calculation based on a simple computation of the property at the amino acid level. Moreover, an automatic algorithm for the evaluation of each cluster recognized by the methodology is straightforward.

9 Automatic Epitope Recognition 121 (b) Ligand (a) Receptor Figure 3: Content of hydrophobic amino acids in actual protein complex interfaces. 4 Conclusion We propose a new methodology to identify binding regions on proteins. Although the description so far was concentrated on hydrophobic cluster identification, the methodology is amenable for use with any other characteristic of electrophysical nature known to drive and affect protein-protein interactions or any other interaction among macromolecules as well as small organic compounds. The results of this analysis indicate that the proposed SOM-FT methodology predicts the binding sites in proteins with a high reliability. This is confirmed also by comparing our results with those of the statistical analysis performed by Lo Conte et al. [9] since the tendencies in characteristics of the components of the clusters are very similar to those reported by them, especially in the Proteaseinhibitor class, where many of the proteins used in the validation of our methodology are classified here. For all the isolated monomers composing the complexes analyzed, the interface for binding corresponds to surface regions of relatively strong hydrophobicity. The analysis does not use any a priori knowledge on the phenomenon besides the three dimensional structure of the proteins involved in interaction. The methodology has a wide applicability, not only oriented to the system for assessment of macromolecular interaction MIAX that we have been developing, but especially in the fields of molecular recognition and drug design where information obtained by the methodology can give relevant clues on molecular specificity, selectivity, and reactivity. Furthermore, biochemical mechanisms of several life sustaining processes in organisms may be explained in function of the binding regions identified in this way, giving clues for the identification of defects in protein structure leading to abnormal function expression of these bio-macromolecules.

10 122 Yoshimori and del Carpio Table 4: Occurrence of amino acids at observed interaction interfaces. Acknowledgements This work was conducted under the support by a Grant-in-Aid for Scientific Research on Priority Areas (C) Genome Information Science from the Ministry of Education, Culture, Sports, Sicence and Technology (MEXT) (No.: ) and the cooperation of Klimers Corp, Nagoya, Japan, in the person of his President Mr. Masaki Kobayashi. References [1] Brasseur, R., Differentiation of lipid-associating helices by use of three-dimensional molecular hydrophobicity potential calculations, J. Biol. Chem., 266: , [2] Del Carpio, C.A. and Yoshimori, A., MIAX: a novel system for assessment of macromolecular interaction in condensed phases. 1) description of the interaction model and simulation algorithm, Genome Informatics, 10:3 12, [3] Del Carpio, C.A. and Yoshimori, A., MIAX: a system for assessment of macromolecular interaction. 3) a parallel hybrid GA for flexible protein docking, Genome Informatics, 11: , [4] Del Carpio, C.A. and Yoshimori, A., MIAX: a novel paradigm for modeling bio-macromolecular interactions and complex formation in condensed phases, in Press. [5] Kleanthous, C., Protein-Protein Recognition, Oxford University Press, [6] Kohonen, T., The self-organizing map, Proc. IEEE, 78: , [7] Korn, A.P. and Burnett, R.M., Distribution and complementarity of hydropathy in multisubunit proteins, Proteins, 9:37 55, [8] Lijnzaad, P, and Argos, P., Hydrophobic patches on protein subunit interfaces: characteristics and prediction, Proteins, 28: , [9] Lo Conte, L. Chothia, C., and Janin, J., The atomic structure of protein-protein recognition sites, J. Mol. Biol., 285: , [10] Young, L., Jernigan, R.L., and Covell, D.G., A role for surface hydrophobicity in protein-protein recognition, Protein Science, 3: , 1994.

Detection of Protein Binding Sites II

Detection of Protein Binding Sites II Goal: Given a protein structure, predict where a ligand might bind Thomas Funkhouser Princeton University CS597A, Fall 2007 1hld Geometric, chemical, evolutionary