BIOINFORMATICS ORIGINAL PAPER

Size: px

Start display at page:

Download "BIOINFORMATICS ORIGINAL PAPER"

Clementine Lucas
6 years ago
Views:

1 BIOINFORMATICS ORIGINAL PAPER Vol. 21 no , pages doi: /bioinformatics/bti522 Structural bioinformatics Prediction of protein protein interactions using distant conservation of sequence patterns and structure relationships Jordi Espadaler 1,, Oriol Romero-Isart 2,,, Richard M. Jackson 2, and Baldo Oliva 1, 1 Grup de Bioinformàtica Estructural (GRIB-IMIM), Departament de Ciències Experimentals i de la Salut, Universitat Pompeu Fabra, Barcelona 08003, Catalonia, Spain and 2 School of Biochemistry and Microbiology, University of Leeds, Leeds LS2 9JT, UK Received on May 16, 2005; accepted on June 13, 2005 Advance Access publication June 16, 2005 ABSTRACT Motivation: Given that association and dissociation of protein molecules is crucial in most biological processes several in silico methods have been recently developed to predict protein protein interactions. Structural evidence has shown that usually interacting pairs of close homologs (interologs) physically interact in the same way. Moreover, conservation of an interaction depends on the conservation of the interface between interacting partners. In this article we make use of both, structural similarities among domains of known interacting proteins found in the Database of Interacting Proteins (DIP) and conservation of pairs of sequence patches involved in protein protein interfaces to predict putative protein interaction pairs. Results: We have obtained a large amount of putative protein protein interaction ( ). The list is independent from other techniques both experimental and theoretical. We separated the list of predictions into three sets according to their relationship with known interacting proteins found in DIP. For each set, only a small fraction of the predicted protein pairs could be independently validated by cross checking with the Human Protein Reference Database (HPRD). The fraction of validated protein pairs was always larger than that expected by using random protein pairs. Furthermore, a correlation map of interacting protein pairs was calculated with respect to molecular function, as defined in the Gene Ontology database. It shows good consistency of the predicted interactions with data in the HPRD database. The intersection between the lists of interactions of other methods and ours produces a network of potentially high-confidence interactions. Contact: boliva@imim.es Supplementary information: BioinformaticsO5_1/Supplementary_material.pdf INTRODUCTION On the importance of protein protein interactions While the amount of genome sequence information increases exponentially, the annotation of protein sequences appears to be lagging behind, both in terms of quality and quantity. Multi-pronged, high-throughput functional genomics approaches are needed to To whom correspondence should be addressed. The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors. Present address: Institut de Física d Altes Energies, Universitat Autònoma de Barcelona, Bellaterra (Barcelona) 08193, Catalonia, Spain bridge the gap between raw sequence information and the relevant biochemical and medical information. A wide repertoire of techniques must be used, from proteomics, bioinformatics, nucleotide chemistry and cell biology to model organisms as well as to find targets for modern drug discovery. Most, if not all, biological processes are regulated through association and dissociation of protein molecules. Moreover, functional units of cells are often complex assemblies of several macromolecules, where proteins play a pivotal role. Clearly, at the molecular level, the function of a protein is determined by the set of molecules it interacts with and the result of the interaction (e.g. chemical reaction, signal transduction, etc). Therefore, protein protein interaction networks play an outstanding role in the organization of life (Bornberg-Bauer et al., 2005). The lower bound on binary protein protein interactions and functional links in yeast have been estimated to be in the range of (Von Mering et al., 2002), which corresponds to about nine partners per protein. However, it has been estimated that most protein protein interactions in nature conform to one of about types (Aloy and Russell, 2004). Current methods to find interactions A major goal of functional genomics is to determine protein interaction networks for whole organisms. Experimental methods that can globally tackle the problem have been developed, such as the yeast two-hybrid system (Uetz et al., 2000) and affinity purification followed by mass spectrometry (Gavin et al., 2002). These highthroughput methods have led to the creation of databases containing large sets of protein interactions, such as Database of Interacting Proteins (DIP) (Salwinski et al., 2004), MIPS (Mewes et al., 2004) and Human Protein Reference Database (HPRD) (Peri et al., 2004). In addition, several in silico methods have been developed to predict protein protein interactions based on features such as gene context. These include gene fusion (Marcotte et al., 1999), gene neighborhood (Dandekar et al., 1998) and phylogenetic profiles (Pellegrini et al., 1999). Although all of these methods can be used to predict interactions, their goals are different. Yeast two-hybrid aims to detect direct binary physical binding, while affinity purification aims to detect physical binding in the form of protein complexes. On the other hand, many in silico methods seek to predict functional association, which often implies but is not restricted to physical binding The Author Published by Oxford University Press. All rights reserved. For Permissions, please journals.permissions@oupjournals.org

2 Prediction of protein interactions by structure Use of protein structure to predict interactions An emerging new approach in the protein interactions field is to take advantage of structural information to predict physical binding (Aloy and Russell, 2003; Lu et al., 2002). Although the total number of complexes of known structure is relatively small, it is possible to expand this set by considering homologous proteins. It has been shown that in the majority of cases close homologs (>30% sequence identity) physically interact in the same way with each other (Aloy et al., 2003). However, conservation of a particular interaction depends on the conservation of the interface between interacting partners. Studies indicate that the compositions of contacting residues are unique, and that incorporating evolutionary and predicted structural information improves the prediction of protein protein interactions (Keskin et al., 2004). In general, it has been shown that residues located at the interface tend to be structurally conserved (Ma et al., 2003). A number of studies on a few protein protein interfaces have addressed the question of which are the critical residues at protein-binding sites and the types of sequence motif to be used on protein protein interaction predictions (Li et al., 2004b). Working hypothesis Based on the availability of complexes of known structure, short stretches of contiguous residues (hereafter called residue patches) involved in the interface can be easily determined by analysis of residue contacts. These patches can be converted into profiles by allowing amino acid substitutions based on conservation of chemical properties. The probability of a short stretch of residues in a protein sequence matching a patch profile by chance can be greatly reduced when requiring a simultaneous match of two or more patches. Therefore, pairs of patches co-occurring in the same protein interface can be used to search for proteins which could display similar interactions. We further combine this local sequence similarity-based method with a domain structure similarity approach to narrow the list of putative new interacting proteins (Nye et al., 2005). This list of interactions therefore possesses both, a common domain structure and common interface sequence pattern to the original interaction pairs. The predicted interface is therefore suggestive of a structural model for the putative protein-protein interface. METHODS We have used seven databases for our analysis: (1) the Swiss-Prot database for protein sequences (release 45.1) (Apweiler et al., 2004); (2) the SCOP database for the classification of protein structures (release 1.65) (Andreeva et al., 2004); (3) the DIP database for protein protein interactions experimentally identified (release of January 2004); (4) the SPIN-PP database of protein complexes with known three-dimensional structure of 855 entries of interfaces with <80% sequence identity; (5) the STRING database, only for predicted protein protein relationships/interactions (Mering et al., 2003); (6) the HPRD database with human protein protein interactions manually curated by a critical reading of the published literature by expert biologists; and (7) the Gene Ontology (GO) database of protein functions (Hill et al., 2002). Predicting new interactions by Sequence Search of Interface Patterns (SSIP) A set of pairs of non-identical interacting proteins (or peptides longer than 20 residues) was extracted from the SPIN-PP database. This is hereafter called the seeding set of protein complexes. The protein complexes of the seeding set were grouped as follows: (1) electron transfer; (2) hydrolases; (3) immune system; (4) isomerases; (5) kinases and phosphorylases; (6) lectins; (7) ligases; (8) lyases; (9) membrane spanning; (10) oxido-reductases; (11) oxygen transport; (12) proteases; (13) toxins; (14) transferases; (15) viral derived interactions; and (16) unclassified. Protein complexes were allowed to belong to more than one group. For a protein complex, we defined the distance between a pair of residues of each interacting protein as the minimum distance between the pairs of atoms of each residue. This definition defines the contact region or interface between two proteins in a complex which is composed of at least two chains (Tsai et al., 1996). The interface of the interaction of two proteins under a specific cut-off was defined as the set of residues of two proteins with a distance below that cut-off. This necessarily yields unordered fragments, as well as isolated residues. We defined a patch of residues of one protein in an interaction as a set of more than five contiguous residues, most belonging to the interface of the interaction. Clearly, a patch of residues depends on the cut-off used to define the interface. Some residues may not belong to the interface and still be sequentially surrounded by residues that belong to the interface. One or two residues (located in i and i +1 in the protein sequence) belong to a patch if they are surrounded by two residues (located in i 1 and i + 1 or i + 2 in the protein sequence) that belong to the same interface and the total length of the patch is larger than five residues. A characteristic interface for each complex of the seeding set was obtained by choosing the minimum atom-atom cut-off distance (between 2 and 5 Å) that produced at least two separated patches of residues in each interacting protein. For each patch of the characteristic interface, a profile was built by multiple-alignment of 100 artificial sequences obtained by random sequence substitutions. These substitutions were not constrained by substitution matrix weights, but by rules based on chemical properties of the residues sidechains. The groups of amino acids transposable for substitutions were as follows: (1) negative charge: Glu and Asp; (2) positive charge: Lys, Arg and His; (3) polar (hydrogen bonding): Ser, Thr, Asn and Gln; (4) non-polar (aliphatic): Ala, Val, Leu, Ile and Met; and (5) non-polar (aromatic): Phe, Trp and Tyr. Hidden Markov models (HMMs) were built with the artificial multiple sequence alignment of a patch, hereafter referred to as HMM-patches. The program HMMER was used to build the HMM-patch (Eddy, 1998). The set of HMM-patches of a protein in one complex (i.e. formed by proteins A and B) were used for searching sequences from Swiss-Prot. They were forced to match at least two HMM-patches of a protein under a threshold P -value of Therefore, the P -value to find a particular protein with at least two patches was < The pairs formed by proteins A and B, where A and B were found using the HMM-patches of A and B, respectively, constituted a set of potentially interacting protein-pairs derived from the seeding complex formed by A and B. In order to further refine the prediction, only those biologically relevant pairs formed by proteins from the same species were considered, and the rest of the pairs were disregarded. For each new potential interaction, the sequences matching the alignment with the HMM-patches were used to generate new artificial multiple alignments. These were used to build new HMM-patches and repeat a new search. The procedure was iterated until no new pairs of proteins were added on the set of potentially interacting protein-pairs derived from a seeding couple, AB. Predicting new interactions by Structure Relationship (SR) We use the hypothesis that homologous sequences share similar interactions and, therefore, the set of interacting partners of a given protein are enriched by its homologs (Espadaler et al., 2005). A protein interaction network can be represented by a graph with nodes as proteins and edges as protein interactions. In such a graph, a set of proteins connected to protein X (i.e. physically interacting with X) is named partners of X. Further, we define successive levels of partnership: the set of partners of X is named partners of X at level 1 and the set of partners of the partners of X at level 1 forms the set of partners at level 2, and so on. Given the commutative relation of the interactions (i.e. if B is found in the set of partners of A, then A is found in the set of 3361

3 J.Espadaler et al Table 1. Total number of predicted interactions by SSIP Functional group Seed-pairs SPIN-PP pairs Homodimers Heteromers Total Homolog No homolog Total Homolog No homolog Entries DIP Entries DIP Entries DIP Entries STRING DIP Entries STRING DIP Entries STRING DIP Electron transfer Hydrolases Immune Isomerases Kinases phosphorylases Lectins Ligases Lyases Membrane spanning Oxido reductases Oxygen transport Proteases Toxins Transferase Unclassified Viral derived The results are shown in each row for the groups from SPIN-PP. The number of interactions used as seed are indicated in the first column. The total of predictions obtained by SSIP are indicated in the columns as Entries. For homomers and heteromers the total number of coincidences with DIP interactions are indicated, while only for heteromers it is possible to corroborate the prediction in STRING.

4 Prediction of protein interactions by structure partners of B), protein X should be in the set of partners of itself at level 2 (see Supplementary Material). Therefore, given that homologous proteins perform similar functions associated with similar interaction partners, the sets of partners of protein X at even levels contain more sequences homologous to protein X than a randomly selected set of sequences of the same size. Similarly, the set of partners of protein X at odd levels should contain proteins that would potentially interact with X. Pairs of potential orthologs of known interacting protein partners from a given organism are identified as potentially conserved interactions, or interologs, in a second organism (Matthews et al., 2001). We extended this assumption by considering all possible relatives of two proteins, where we defined as relatives those proteins that share similar fold and function. We used the SCOP database to assign a fold for as many sequences as possible in DIP. Fold, superfamily and family domain codes of SCOP were assigned to a total of 4324 proteins in DIP that could be matched by BLAST to a protein in SCOP, covering one-sixth of all proteins in DIP (i.e. group DIP-SCOP). More precisely, one or more domain codes were assigned to a protein sequence in DIP when the alignment between the two sequences had an E-value < 10 8 over at least 75% of the residues in the SCOP domain. All proteins sharing at least one domain family-code with another protein X are defined as homologs of X. The algorithm to obtain potential new interactions on the basis of structure involves four steps. First, we search all the relatives (using SCOP codes for family) of a pair of interacting proteins, named A and B, extracted from DIP. Second, we consider the pair of proteins A and B, relative to A and B, respectively, will be potential interactions. Third, we increase the number of potential interactions by considering all protein pairs formed by the combination of the relatives of A (including A ) with all partners of A at odd levels (including B) and their relatives (including B ). Similarly, the partners of B at odd levels (including A), and all their relatives (including A ), can potentially interact with any of the relatives of A (including A ). And fourth, redundant couples (independent of the order of proteins forming the pair) were removed. In order to avoid a high number of false positives we only analyzed the odd levels 1 and 3 of structure-based partnership. Combined potential interactions The intersection of the two sets of potential interactions consisting of the SSIP and those predicted by the SR were separated into three lists of pairs. These are designated as follows: (I 1 ) i.e. pairs formed by two interacting proteins from DIP; (I 2 ) i.e. pairs formed by two proteins, A and B, each one sharing at least one domain of the same family with one of the proteins from a pair of interacting proteins in DIP, C and D (i.e. A and C, and also B and D, have a domain of the same family, respectively); and (I 3 ) i.e. pairs formed by proteins that cannot be related with a pair of interacting proteins in DIP through a domain of the same family. The first set (I 1 ) corresponds to the most probable interactions. The second set (I 2 ) corresponds to potential interactions found by using the first structural-based partnership level, and the third set (I 3 ) to potential interactions that could only be found through the third structure-based partnership level. All sets were found by SSIP. Validation of the interactions by HPRD database and prediction of interologs To further evaluate the method we analyzed the sets of the interacting pairs formed by two human proteins of groups I 1,I 2 and I 3. These sets were compared with the database of manually confirmed interactions of human proteins from the HPRD. Finally, the successfully predicted interactions in human proteins can be expanded to the rest of orthologous genes in other species, increasing the number of predictions by putative interologs. It has to be noted, however, that we cannot calculate the accuracy of the prediction in this evaluation, as a large number of predicted interactions has not been tested and should not be considered false. Therefore, we have compared the molecular function of the pairs of interacting proteins in the HPRD database and in the predicted sets as a measure of accuracy (Von Mering et al., 2003). The function of proteins was defined at level two of the GO molecular function ontology. RESULTS Protein protein interactions predicted by sequence search A total of putative interactions were obtained with sequences after searching with HMM-patches using the SSIP method described above. Of the sequences a total of 8552 are also defined as nodes in the DIP database. Figure 1 shows the normalized number of heteromer and homodimer interactions predicted for each functional group defined in SPIN-PP. The hydrolases, proteases and transferases show a larger normalized number of non-homologous heteromer interactions with respect to the other groups. This is not necessarily related to the number of SPIN-PP or seed pairs (cf. immune, oxido-reductases, toxins and viral derived). The prediction of interactions obtained by means of the group of the immune system was the largest set corroborated by DIP (11 out of 748 predictions). Also, a total of 1603 out of 7939 predicted nonhomologous heteromer interactions from the group of transferases were independently predicted by STRING. Table 1 shows the distribution of predicted pairs according to each group of the seeding set. The SSIP method predicts 46 interactions that are also described in the DIP. Most of these interactions are found between pairs of proteins homologous to an interacting couple in the seeding set (39 out of 46), showing that the method found mostly putative interologs, which are pairs of potential orthologs of known interacting protein partners with conserved interactions. The method predicts 3885 interactions that are also found in the STRING database (Mering et al., 2003); however, in this case 3760 belonged to pairs without homology to any of the pairs in the seeding set. These protein protein interactions can be considered to have been arrived at independent of the gene context methods used by STRING. It is not possible to perform the iterative method of prediction (SSIP) for all protein pairs from SPIN-PP, because the dataset of protein-sequences is limited, and the method did not necessarily find new putative patterns for all binding sites. Even though dimmers are removed from the seeding set, it is still possible to predict homodimers. This implies that the sequences of the proteins of a predicted dimmer were matched by remote similarity to the HMMpatches of a pair of interacting proteins in the seeding set that were heteromers. Protein protein interactions predicted by structure For the method of predicting new interactions by Structure Relationships (see Methods section), SCOP codes could be assigned to proteins in DIP. This DIP SCOP subset covers one-sixth of all proteins in DIP. Each pair of proteins from DIP was expanded to interactions, based on the known structure of at least one of the proteins of the pair. If the structure of one of the proteins from the interacting pair was not known, its expansion is not performed. Consequently, we obtain interacting pairs formed by: (i) the relation between two sets of proteins (i.e. using two SCOP family codes, with N and M proteins, respectively) that produced N M putative interacting pairs; (ii) between one single protein and one set of proteins (i.e. with N proteins of the same family) that produced N putative interacting pairs; or (iii) between two single proteins (i.e. the expansion could not be performed for any of the proteins that 3363

5 J.Espadaler et al. Fig. 1. Distribution of predicted interactions of heteromers (a) and homomers (b) by SSIP. The total number of interactions is divided by the number of original interactions from the SPIN-PP database. The normalized number of predictions is shown in bars for each group of seeding pairs from SPIN-PP with interacting pairs of homolog proteins to the seeding pair (black bar) or predictions of non-homolog interactions (gray bar). form the pair). This produced a total of 1220 putative interacting protein-sets (formed by pairs of protein families, or one protein family and a single protein, or two single proteins), with a confirmed protein protein interaction in DIP. The intersection of protein pairs predicted by the SR and SSIP methods was performed, and separated into three sets (see Methods section): (I 1 ) 41 pairs formed by two interacting proteins from DIP (see Supplementary Material); (I 2 ) 1220 protein-sets (given above) that yields protein pairs formed by two proteins, each one sharing at least one domain of the same family with one of the proteins from a pair of interacting proteins in DIP; and (I 3 ) proteinsets yielding protein pairs formed by proteins that cannot be related with a pair of interacting proteins in DIP, through a domain of the same family. The interactions of set I 1 are interactions corroborated experimentally, with a known family relationship in SCOP and sequence patterns that match the requirements of a seeding interaction from SPIN-PP. The predicted interactions set I 2 contains a larger number of interactions than I 1 ; however, few have been confirmed experimentally. Although set I 1 is not making any new prediction, it can be used as a reference for the method. More precisely, we can check the examples by modeling the protein complex to confirm their validity. An example of a predicted interaction The predicted complex between Actin-2 and Profilin in Drosophila (in set I 1 ) can be modeled by that of the structurally characterized orthologs in Bos taurus. Of the three binding sites of B.taurus, two are recognized in the sequences of Actin-2 and Profilin in Drosophila (Fig. 2). The third binding site is half lost with a gap; however, our sequence search was able to detect the location of the binding sites. The putative structure of Actin-2 and Profilin was predicted by means of homology using PSI-BLAST (Altschul et al., 1997), with the corresponding coincidence of fold and family. Also, the interaction was experimentally found by yeast two-hybrid in Drosophila melanogaster. Therefore, this interaction is found in set I 1 with our method. 3364

Prediction of protein interactions by structure (a) -MAGRLPACVIDVGTGYSKLGFAGNKEPQFIIPSAIAIKESARVGDTNTRRITKGIEDLD --DDDIAALVVDNGSGMCKAGFAGDDAPRAVFPSIVGRPRHQGV------MVG--MGQKD

6 Prediction of protein interactions by structure (a) -MAGRLPACVIDVGTGYSKLGFAGNKEPQFIIPSAIAIKESARVGDTNTRRITKGIEDLD --DDDIAALVVDNGSGMCKAGFAGDDAPRAVFPSIVGRPRHQGV------MVG--MGQKD FFIGDEAFDATG-YSIKYPVRHGLVEDWDLMERFLEQCVFKYLRAEPEDHYFLLTEPPLN SYVGDEAQSKRGILTLKYPIEHGIVTNWDDMEKIWHHTFYNELRVAPEEHPVLLTEAPLN TPENREYTAEIMFETFNVPGLYIAVQAVLALAASWASRSAEERTLTGIVVDSGDGVTHVI PKANREKMTQIMFETFNTPAMYVAIQAVLSLYAS GRT-TGIVMDSGDGVTHTV PVAEGYVIGSCIKHIPIAGRNITSFIQSLLREREVGIPPEQSLETAKAIKEKHCYICPDI PIYEGYALPHAILRLDLAGRDLTDYLMKILTERGYSFTTTAEREIVRDIKEKLCYVALDF AKEFAKYDTEPGKWIRNFSGVNTVTKAPFNVDVGYERFLGPEIFFHPEFSNPDFTIPLSE EQEMATAASSS-SLEKSYELPDGQV-----ITIGNERFRCPEALFQPSFL-GMESCGIHE IVDNVIQNCPIDVRRPLYNNIVLSGGSTMFKDFGRRLQRDIKRSVDTRLRISENLSEGRI TTFNSIMKCDVDIRKDLYANTVLSGGTTMYPGIADRMQKEIT AL KPKPIDVQVITHHMQRYAVWFGGSMLASTPEFYQVCHTKAAYEEYGPSICRHNPVFGTMT APSTMKIKIIAPPERKYSVWIGGSILASLSTFQQMWISKQEYDESGPSIVHRKCF----- (c) (b) PROF_DROME 2btfP PROF_DROME 2btfP PROF_DROME 2btfP MSWQDYVDNQLLASQCVTKACIAGHDG--NIWAQSSGFEVTK---EELSKLISGFDQQ-- AGWNAYIDN-LMADGTCQDAAIVGYKDSPSVWAAVPGKTFVNITPAEVGILVGKDRSSFF DGLTSNGVTLAGQRYIYLSGT-DRVVRAKLGRSG------VHCMKTTQAVIVSIYEDPVQ VNGLTLGGQKCSVIRDSLLQDGEFTMDLRTKSTGGAPTFNITVTMTAKTLVLLMGKEGVH PQQAASVVEKLGDYLITCGY GGMINKKCYEMASHLRRSQY Fig. 2. Structure of the interaction between Actin-2 () and Profilin (PROF_DROME) from Drosophila. Sequence alignment of Actin-2 (a) and Profilin (b) and the HMM profiles from chains A and P of 2btf (β-actin profilin complex from taurus), respectively. The aligned sequences of the binding sites of Actin-2 and Profilin are indicated in background colors according to the interaction: cyan, green and yellow. (c) Superposition of the X-ray structure of β-actin profilin complex (actin in orange, profiling in cyan) and the model of de complex between actin (green) and profilin (red) from Drosophila, obtained with the program MODELLER (Eswar et al., 2003). The superposition was obtained with the program STAMP (Russell and Barton, 1992). Owing to the large evolutionary distance between bovine and fly, the sequences of the binding sites have undergone significant changes. Still, we can use our method to corroborate this interaction, as we have used loose restrictions on the E-value and the presence of at least two binding sites. Consequently, the structure of the complex of Actin-2 and Profilin from Drosophila is modeled using the structure of the complex in B.Taurus (Fig. 2). The final complex structure is obtained by superimposition of the two models on their respective template structures in the complex (Fig. 2c). Although further analysis is required to corroborate the binding energy, we can readily accept the interaction and the model, as these have been experimentally probed. The number of interactions in sets I 2 and I 3 are too large to be similarly treated. Therefore, we need other methods to validate the putative interactions. Here, we compare the results with additional databases and check the consistency of the predictions. Validation of the interactions by HPRD database and prediction of interologs In order to validate the sets of predicted interactions, the results were corroborated with the set of non-dimmer protein protein interactions of the HPRD database. The percentage of validated interactions found in I 1,I 2 and I 3 are indicated in Table 2. The percentage is compared with the result of: (i) using only sequence patterns (SSIP) for the prediction, (ii) pairs predicted by SSIP that are also found in the DIP database (SSIP-DIP) and (iii) pairs predicted by SSIP that are also found in the STRING database (SSIP-STRING). We found 2636 human proteins for which interactions were predicted by SSIP. Interactions numbering 3044 in HPRD involve a pair of proteins from this set. Only 0.17% or 127 out of interactions were validated in HPRD, with <5% coverage. However, the probability of finding the 3044 interactions of HPRD in possible pairs (i.e /2) is 0.09%, while the probability by SSIP was increased to 0.17%. These results show the improvement in coverage with respect to a random selection of interactions. Nevertheless, the best-validated result is obtained with the combination I 1, which gives the worst coverage. An intermediate solution is obtained with set I 2 (% of validated interactions is 0.53% and the coverage is still 1.8%). Since there is no definitive measure for validating predictions, as in the previous experimental and theoretical studies of protein interactions, we used the tendency for interacting proteins to belong to the same GO functional class as a measure of reliability (Von Mering et al., 2003). Our results were compared to the distribution of pairs in the HPRD database of manually confirmed interactions of human proteins. The correlation map of interacting pairs was calculated with respect to molecular function as defined in GO for each protein. The distributions in Figure 3 show the consistency of the prediction (in sets I 2,I 3 and SSIP-STRING) with the data in the HPRD database. The number of pairs containing proteins with enzymatic activity is large, because the seeding set used to obtain the prediction contained a large number of enzymes. Nevertheless, the proportion of pairs containing at least one protein with catalytic activity is similarly distributed in the HPRD database, with 3365

7 J.Espadaler et al. Table 2. Comparative results of predictions of interactions Percentage of validated in HPRD Percentage coverage of HPRD Interactions found in HPRD Predicted interologs Total of interactions Total of interactions of human proteins Total sequences Total human sequences of HPRD I I I SSIP-STRING SSIP-DIP SSIP The percentage of interactions validated by the database of human known interactions HPRD and its coverage is indicted in the first two columns. This is calculated with the corresponding number of pairs of interacting human proteins of the human sequences from HPRD found in each set. (a) (b) (c) (d) Fig. 3. Density of interactions according to molecular function. Density of protein interactions in the HPRD database (a), SSIP-STRING set of interactions (b), I 2 (c) and I 3 (d). The distribution of protein protein interactions is calculated as the ratio of interaction pairs in the square over the total number of protein pairs possibly formed by combination of the proteins in the square. Each square compares sets of proteins with molecular functions defined as in GO: motor activity (M), catalytic activity (C), signal-transduction (ST), structural molecule (S), transporter (T), enzyme-regulator (ER), transcription (TC), translation (TL) and unknown activity (U). The scale of grays shown in the left indicates the intervals of protein protein interactions over possible pairs. The total number of interactions in each set is shown in Table 2, while the HPRD database contains protein protein interactions. few exceptions (i.e. the interactions between enzymes and structural proteins and signal transduction activity, having a larger number of predicted interactions than the number of known interactions in the HPRD database). In addition, the set of interactions correctly predicted among human protein protein interactions is expanded to other species. As only minimal progress has been made in mapping the human proteome using high-throughput screens, the transfer of interaction information within and across species has become increasingly important. This transfer is obtained by assigning pairs of interactions of orthologous genes. Similarly, if two human proteins interact (A and B), the product of its orthologous genes in other species (A and B ) may also interact (this is known as an interolog). According to this hypothesis, we search for all protein pairs in sets I 1,I 2 and I 3 formed by interologs of a pair of interacting proteins in HPRD. This produces in I 1,I 2 and I 3 a total each of 8, 184 and 134 interactions, respectively (Table 2). These predictions are annotated as predicted interologs and they are predicted by our method and also by the interologs approach. The total number of interactions found in HPRD and predicted in I 1,I 2 and I 3 is 108, while the total predicted in SSIP-DIP and 3366

8 Prediction of protein interactions by structure SSIP-STRING and found in HPRD is only 10. The coverage of the prediction in HPRD joining the sets I 1,I 2 and I 3, is 10 times larger than that joining SSIP-DIP and SSIP-STRING (Table 2). Also, the number of interologs predicted using sets I 1,I 2 and I 3 is almost 10 times larger than that predicted when using SSIP-DIP and SSIP-STRING. DISCUSSION A widely adopted methodology is to use the knowledge of the location of binding sites to discover protein protein interactions. Recent studies have been devoted to characterizing and extraction of motif sequences involved in binding sites (Li et al., 2004a) or physical properties involved in the area of interaction (Li et al., 2004b). In the present work, we have extracted sequence motifs involved in the interaction of known complexes. We have transformed these motifs into HMM-patches, according to the conservation of the hydrophobic/hydrophilic relationships between residues. We have used these profiles to search sequences with remote homology that contain two features: (1) more than one interface sequence motif and (2) a degree of structural similarity to the original proteins involved in the interaction. In the present work, we have obtained a large amount of putative protein protein interactions. The lists obtained are independent of other techniques, experimental and theoretical. Consequently, the intersection between the lists of interactions of these methods and ours produces a network of high-confidence interactions. We cannot independently corroborate whether these predictions are correct or not, except some specific cases, such as those predicted for set I 1 or where independent experimental confirmation exists for a protein interaction in the literature. As the sets of interactions are predicted according to a seeding complex of known structure, we can further develop a test based on the comparative modeling of the proteins involved in the interaction and the construction of the putative interface. This was done with an example for which the interaction has been experimentally probed. The result helps in understanding not only the difficulties but also the advantages of this final corroboration. Unfortunately, this cannot be done for the sets I 2 and I 3 with more than predictions, unless an automatic procedure for construction and verification is performed. Therefore we have adopted a different approach to test the quality of the predictions, and to also increase the confidence in some of the predicted interactions we analyzed the total number of predictions by comparison with other databases of protein protein interactions. This comparison was performed with respect to the HPRD database. The method of obtaining our predictions, derived from the DIP database, differs from the database presented in HPRD. Therefore, the number of common interactions in our predicted sets and in HPRD was understandably small. On the other hand, it is expected that most interactions are performed by proteins with similar function and/or location, as many topological studies of the interactome graph have shown (Bader et al., 2004; Deng et al., 2004). Indeed, we also obtained a similar correlation map for protein molecular function, as described in GO, with the interactions of HPRD and our predicted interactions. Yet another method of prediction is based on the transfer of the interaction to other species by homology (interologs). We expect that the rate of false interactions produced in the transfer from human to another specie will be reduced if the interaction is also predicted in the sets of I 1,I 2 or I 3. We obtained a total of 326 protein protein interactions with high expectations of their being true. The predicted protein protein interactions described here have to be corroborated either by experiments or by additional prediction methods for protein protein interactions. In conclusion, further analysis of the structure of these complexes needs to be performed in order to validate some of these interactions. Work is in progress towards an automatic procedure that could discriminate true interactions from false. ACKNOWLEDGEMENTS O.R.I. acknowledges funding from the Strategic Research Fund, Faculty of Biological Sciences, University of Leeds. J.E. acknowledges student fellowships of Departament d Universitats, Recerca i Societat de la Informació de la Generalitat de Catalunya (DURSI). This work was supported by grants from Fundación Ramón Areces and Spanish Ministerio de Ciencia y Tecnología (McyT, BIO ). Conflict of Interest: none declared. REFERENCES Aloy,P. and Russell,R.B. (2003) InterPreTS: protein interaction prediction through tertiary structure. Bioinformatics, 19, Aloy,P. and Russell,R.B. (2004) Ten thousand interactions for the molecular biologist. Nat. Biotechnol., 22, Aloy,P. et al. (2003) The relationship between sequence and interaction divergence in proteins. J. Mol. Biol., 332, Altschul,S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, Andreeva,A. et al. (2004) SCOP database in 2004: refinements integrate structure and sequence family data. Nucleic Acids Res., 32, D226 D229. Apweiler,R. et al. (2004) UniProt: the Universal Protein knowledgebase. Nucleic Acids Res., 32 (Database issue), D115 D119. Bader,J.S. et al. (2004) Gaining confidence in high-throughput protein interaction networks. Nat. Biotechnol., 22, Bornberg-Bauer,E. et al. (2005) The evolution of domain arrangements in proteins and interaction networks. Cell Mol. Life Sci., 62, Dandekar,T. et al. (1998) Conservation of gene order: a fingerprint of proteins that physically interact. Trends Biochem. Sci., 23, Deng,M. et al. (2004) Mapping Gene Ontology to proteins based on protein protein interaction data. Bioinformatics, 20, Eddy,S.R. (1998) Profile hidden Markov models. Bioinformatics, 14, Espadaler,J. et al. (2005) Detecting remotely related proteins by their interactions and sequence similarity. Proc. Natl Acad. Sci. USA, 102, Eswar,N. et al. (2003) Tools for comparative protein structure modeling and analysis Nucleic Acids Res., 31, Gavin,A.C. et al. (2002) Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature, 415, Hill,D.P. et al. (2002) Extension and integration of the Gene Ontology (GO): combining GO vocabularies with external vocabularies. Genome Res., 12, Keskin,O. et al. (2004) A new, structurally nonredundant, diverse data set of protein protein interfaces and its implications. Protein Sci., 13, Li,H. et al. (2004a) Discovery of binding motif pairs from protein complex structural data and protein interaction sequence data. Pac. Symp. Biocomput., Li,X. et al. (2004b) Protein Protein interactions: hot spots and structurally conserved residues often locate in complemented pockets that pre-organized in the unbound states: implications for docking. J. Mol. Biol., 344, Lu,L. et al. (2002) Multiprospector: an algorithm for the prediction of protein protein interactions by multimeric threading Proteins, 49, Ma,B. et al. (2003) Protein protein interactions: structurally conserved residues distinguish between binding sites and exposed protein surfaces. Proc. Natl Acad. Sci. USA, 100, Marcotte,E. et al. (1999) Detecting protein function and protein protein interactions from genome sequences. Science, 285,

9 J.Espadaler et al. Matthews,L. et al. (2001) Identification of potential interaction networks using sequencebased searches for conserved protein protein interactions or interologs. Genome Res., 11, Mering,C.V. et al. (2003) STRING: a database of predicted functional associations between proteins. Nucleic Acids Res., 31, Mewes,H.W. et al. (2004) MIPS: analysis and annotation of proteins from whole genomes. Nucleic Acids Res., 32 (Database issue), D41 D44. Nye,T.M. et al. (2005) Statistical analysis of domains in interacting protein pairs. Bioinformatics, 21, Pellegrini,M. et al. (1999) Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc. Natl Acad. Sci. USA, 96, Peri,S. et al. (2004) Human protein reference database as a discovery resource for proteomics. Nucleic Acids Res., 32 (Database issue), D497 D501. Russell,R. and Barton,G. (1992) Multiple protein sequence alignment from tertiary structure comparison: assignment of global and residue confidence levels. Proteins, 14, Salwinski,L. et al. (2004) The Database of Interacting Proteins: 2004 update. Nucleic Acids Res., 32 (Database issue), D449 D451. Tsai,C.J. et al. (1996) A dataset of protein protein interfaces generated with a sequenceorder-independent comparison technique. J. Mol. Biol., 260, Uetz,P. et al. (2000) A comprehensive analysis of protein protein interactions in Saccharomyces cerevisiae. Nature, 403, Von Mering,C. et al. (2002) Comparative assessment of large-scale data sets of protein protein interactions. Nature, 417, Von Mering,C. et al. (2003) Genome evolution reveals biochemical networks and functional modules. Proc. Natl Acad. Sci. USA, 100,

CSCE555 Bioinformatics. Protein Function Annotation

CSCE555 Bioinformatics. Protein Function Annotation CSCE555 Bioinformatics Protein Function Annotation Why we need to do function annotation? Fig from: Network-based prediction of protein function. Molecular Systems Biology 3:88. 2007 What s function? The