Nature Methods: doi: /nmeth Supplementary Figure 1

Size: px

Start display at page:

Download "Nature Methods: doi: /nmeth Supplementary Figure 1"

Julius Simon
5 years ago
Views:

1 Supplementary Figure 1 Estimating FDR of PPI predictions. (a-b) We used the approach of D'Haeseleer and Church 1 to estimate FDR. This approach calculates the FDR of a PPI dataset, D, by analyzing intersections among three PPI datasets, D, R, and D, where R is a reference set of trusted PPIs and D is a set of PPIs from a method similar to that of D. It is assumed that the overlap of any two datasets contains largely true positive PPIs. The number of nonoverlapping true positives, IV, is calculated from the numbers of shared PPIs: IV = (II III) / I. Then, the number of false positives, V, and the FDR are calculated. The FDR tends to be low if D has a high overlap with either D or R. (c)to calculate the FDR of FpClass we initially set D to our top 35,000 proteome-wide predictions, excluding any PPIs used in training; (we subsequently calculated FDR for larger sets of FpClass predictions (panels d-g)). We defined R as a set of experimentally detected interactions and D as the union of high confidence predictions from previous studies by Rhodes et al., , Scott et al., , Elefsinioti et al., , and Zhang et al., Using a similar approach, we calculated FDRs for high-confidence predictions from these previous studies. For example, to calculate the FDR for Rhodes et al. 2, we defined D as high-confidence predictions from that study, and D as the union of top FpClass predictions and high-confidence predictions from the three remaining previous studies. To ensure that estimated FDRs were not due to biases of a particular reference set, we repeated FDR calculations using 6 reference sets. We calculated FDRs using each reference set, except when the intersection of datasets D, D', and R comprised less than 5 PPIs. In such cases the FDR is indicated as NA. (d-g) Using the approach of D'Haeseleer and Church 1, we estimated FDRs of predicted networks of various sizes from FpClass and four previous prediction methods. The approach of D'Haeseleer and Church 1 requires a trusted reference set of PPIs. We tried four ways of defining this set: (d) using six reference sets (panel c) individually, and then calculating the median of the six resulting FDR estimates, (e) using the union of PPIs from methods that detect direct interactions (Y2H and LUMIER reference sets), (f) using the union of our six reference sets, and (g) using the union of Y2H reference sets. 1 D haeseleer, P. & Church, G. M. Estimating and improving protein interaction error rates. Proc IEEE Comput Syst Bioinform Conf (2004). 2 Rhodes, D. R. et al. Probabilistic model of the human protein-protein interaction network. Nat Biotechnol 23, (2005). 3 Scott, M. S. & Barton, G. J. Probabilistic prediction and ranking of human protein-protein interactions. BMC Bioinformatics 8, 239 (2007). 4 Elefsinioti, A. et al. Large-scale de novo prediction of physical protein-protein association. Mol Cell Proteomics 10, M (2011). 5 Zhang, Q. C. et al. Structure-based prediction of protein-protein interactions on a genome-wide scale. Nature 490, (2012).

Supplementary Figure 2 Experimental validation of PPI predictions. (a) Predicted interactions tested by Co-IP assays. (b-c) Predicted interactions tested by GST pull-down assays.

2 Supplementary Figure 2 Experimental validation of PPI predictions. (a) Predicted interactions tested by Co-IP assays. (b-c) Predicted interactions tested by GST pull-down assays. (d) Predicted interaction partners of p53 include some of its known partners and d0 proteins. The x-axis indicates the number of top predicted partners, ranked from 1 to The y-axis indicates the number of known partners and d0 proteins, among the top predicted partners.

3 Supplementary Figure 3 Top Gene Ontology (GO) categories among d0 genes. (a-c) GO analysis includes genes without GO annotations. (d-f) GO analysis excludes genes without GO annotations. P-values were calculated by hypergeometric tests and adjusted for multiple testing using FDR.

4 Supplementary Figure 4 Percentages of d0 proteins in drug-target classes and structural properties of d0 proteins. (a) Main drug target classes and (b) receptor drug target classes, as defined by Imming et al. 6. Dashed lines indicate the percentage of d0 proteins in the proteome. P-values were calculated by hypergeometric tests and adjusted for multiple testing using FDR. (c) SCOP 7 structural classes. P-values were calculated by hypergeometric tests and adjusted for multiple testing using FDR. (d) Protein lengths from UniProt 8 and (e) protein disorder, predicted with DISOPRED 9. P-values for protein length and disorder were calculated by two-sided Mann-Whitney U tests. 6 Imming, P., Sinning, C. & Meyer, A. Drugs, their targets and the nature and number of drug targets. Nat Rev Drug Discov 5, (2006). 7 Andreeva, A. et al. Data growth and its impact on the SCOP database: new developments. Nucleic Acids Res 36, D (2008). 8 The UniProt Consortium. The Universal Protein Resource (UniProt) in Nucleic Acids Res 38, D142 8 (2010). 9 Ward, J. J., Sodhi, J. S., McGuffin, L. J., Buxton, B. F. & Jones, D. T. Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. J Mol Biol 337, (2004).

5 Supplementary Figure 5 Median and maximum expression of d0- and dk-encoding genes. P-values were calculated by two-sided Mann-Whitney U tests. (a-d) Median expression of d0 and dk genes in healthy human tissues. Gene expression data was taken from (a) Su et al., , (b) Roth et al., , (c) Wang et al., , and (d) Krupp et al., (e-h) Maximum expression of d0 and dk genes in the same datasets. 10 Su, A. I. et al. A gene atlas of the mouse and human protein-encoding transcriptomes. Proc Natl Acad Sci U S A 101, (2004). 11 Roth, R. B. et al. Gene expression analyses reveal molecular relationships among 20 regions of the human CNS. Neurogenetics 7, (2006). 12 Wang, E. T. et al. Alternative isoform regulation in human tissue transcriptomes. Nature 456, (2008). 13 Krupp, M. et al. RNA-Seq Atlas--a reference database for gene expression profiling in normal tissue by next-generation sequencing. Bioinformatics 28, (2012).

6 Reference Set Name PPI Detection Method Number of PPIs PPI Sources LUMIER LUMIER 756 Barrios-Rodiles et al., MS HT mass spectrometry PPI Detection Miller et al., ,740 Behrends et al., Bouwmeester et al., Glatter et al., Hutchins et al., Jeronimo et al., Jorgensen et al., Sowa et al., Xiao et al., Small-scale screens Various 4,547 compiled in I2D 14 Y2H Bandyopadhyay10 yeast 2-hybrid 1,781 Bandyopadhyay et al., Y2H CCSB HI yeast 2-hybrid 12,227 The Center for Cancer Systems Biology (CCSB) at the Dana-Farber Cancer Institute harvard.edu/index.php?page= login&lg=/h_sapiens/index. php?page=newrelease Y2H Wang11 yeast 2-hybrid 3,160 Wang et al., Supplementary Table 1 Reference sets used for evaluating PPI prediction methods. Six reference sets were used to evaluate FpClass and previous prediction methods by Rhodes et al., , Scott et al., , Elefsinioti et al., , and Zhang et al., The number of PPIs in a reference set is the union of PPIs from the reference set's sources, excluding any PPIs used in the training of prediction methods. MS data (protein complexes) were converted to binary interactions using a spoke model, where the bait is assumed to interact with all members of the complex. Protein Symbol d0 Score Detected Receptor-interacting serine/threonine-protein kinase 1 RIPK1 no yes Caspase-8 CASP8 no no Serine/threonine-protein kinase PAK 1 PAK1 no yes Bcl-2-like protein 1 BCL2L1 no no Induced myeloid leukemia cell differentiation protein Mcl-1 MCL1 no no Supplementary Table 2 Five predicted PYCARD (PYD And CARD Domain Containing) interactions were tested by Co-IP assays and 2 were confirmed. The first 2 columns show the protein name and gene symbol of predicted interaction partners of PYCARD. Column 3 indicates whether the predicted partner is a d0 protein, column 4 shows the score of the interaction, and the last column shows whether binding to PYCARD was detected by Co-IP assays. I2D version Human PPIs Human proteins I2D ,713 9,799 I2D ver (used in analysis) 114,906 14,109 I2D ver ,831 14,565 Supplementary Table 3 Numbers of experimentally detected human PPIs and proteins in I2D 14. PPIs predicted from interacting orthologs in model organisms are not included.

7 Domain InterPro ID Description P-Value Olfactory receptor IPR olfactory receptor 4.84e-152 GPCR, rhodopsin-like, 7TM IPR hormone, neurotransmitter and light receptors 3.55e-76 Krueppel-associated box IPR nucleic acid binding 4.16e-20 Keratin, high sulphur B2 protein IPR synthesized during differentiation of hair matrix cells 2.71e-16 Mammalian taste receptor IPR taste receptor 1.06e-09 Zinc finger, C2H2-type/integrase, DNAbinding IPR nucleic acid binding 1.48e-09 Major facilitator superfamily, general IPR secondary membrane transport 9.20e-06 substrate transporter Zinc finger, C2H2-like IPR DNA-binding motif in eukaryotic transcription factors 6.03e-05 Cadherin, N-terminal IPR Ca2+-dependent cell-cell adhesion 2.20e-4 GAGE IPR exclusively in humans; unknown function; implicated in 3.63e-4 cancers UDP-glucuronosyl/UDP-glucosyltransferase IPR transferase activity, transfer of hexosyl groups 1.70e-3 Major facilitator superfamily MFS-1 IPR transmembrane transport 3.80e-3 Peptidase M12B, ADAM-TS IPR metallopeptidase activity; zinc ion binding 5.23e-3 ADAM-TS Spacer IPR metalloendopeptidase activity; implicated in some cancers 1.01e-2 and inflammatory diseases Cytochrome P450 IPR oxidation-reduction 1.84e-2 Sulfotransferase domain IPR sulfotransferase activity 2.44e-2 Supplementary Table 4 InterPro annotations enriched among interactome orphans. P-values were calculated by hypergeometric tests and adjusted for multiple testing using FDR. PTM Total proteins with PTM D0 proteins with PTM Deficiency P-value Acetylation e-219 Dephosphorylation e-20 Disulfide Bridge e-22 Glycosylation e-54 Methylation e-15 Myristoylation e-08 Palmitoylation e-09 Phosphorylation e-308 Prenylation e-308 Proteolytic Cleavage e-61 S-Nitrosylation e-15 Sumoylation e-308 Ubiquitination e-308 Supplementary Table 5 D0 proteins are annotated with fewer post-translational modifications (PTMs) than other proteins. Shown are the 13 most frequent PTMs of human proteins (column 1), the numbers of human proteins with PTMs (column 2), the corresponding numbers of d0 proteins (column 3), and P-values characterizing the deficiency of PTMs among d0 proteins. P-values were calculated by hypergeometric tests and adjusted for multiple testing using FDR.

8 Feature of d0 proteins d0 d0 Rual05 d0 CCSB HI Protein Age: < 20M yrs 2.56e e e-01 Protein Age: 20M - 90M yrs 6.91e e e-01 Protein Age: 90M - 320M yrs 1.97e e e-01 Protein Age: 320M - 800M yrs 2.71e e e-04 GO CC: extracellular region 5.62e e e-19 GO MF: signal transducer 8.69e e e-12 GO MF: receptor 2.35e e e-29 GO MF: nucleic acid binding GO MF: transporter GO BP: lipid metabolism GO BP: transmembrane transport GO BP: carbohydrate metabolism SCOP structural class: membrane High tissue specificity Short protein length Low protein disorder 1.06e e e e e e e e e e e e e e e e e e e e e e e e e e e-27 Supplementary Table 6 Interaction detection methods are partly responsible for biases of d0 proteins. To investigate the link between detection methods and d0 biases, we analyzed two high-throughput (HT) human Y2H screens: Rual et al. 15 and the Center for Cancer Systems Biology Human Interactome (CCSB HI). For each screen we defined a set of degree 0 proteins comprising proteins that were tested in the screen but for which no interactions were detected. We then tested whether features enriched in d0 proteins (column 2) were also enriched in degree 0 proteins from Rual et al. 15 (column 3) and from CCSB HI (column 4). For most features (e.g., Protein Age: < 20M yrs), P- values were calculated using hypergeometric probability. For three features, high tissue specificity, short protein length, and low protein disorder, P-values were calculated using the Mann-Whitney U test and indicate whether degree 0 proteins had significantly lower (higher) values than proteins with degrees > 0 in the same screen(s) (e.g., column 3 compares proteins with degrees = 0 to proteins with degrees > 0 in Rual et al. 15 ). P-values were adjusted for multiple testing using FDR. Feature of d0 proteins d0 d0 Direct d0 Indirect Protein Age: < 20M yrs 2.56e e e-06 Protein Age: 20M - 90M yrs 6.91e e e-16 Protein Age: 90M - 320M yrs 1.97e e e-81 Protein Age: 320M - 800M yrs 2.71e e e-17 GO CC: extracellular region 5.62e e e-31 GO MF: signal transducer 8.69e e e-33 GO MF: receptor 2.35e e e-42 GO MF: nucleic acid binding 1.06e e E+00 GO MF: transporter 2.27e e e-07 GO BP: lipid metabolism 7.49e e e-05 GO BP: transmembrane transport 5.58e e e-05 GO BP: carbohydrate metabolism 2.52e e e-03 SCOP structural class: membrane 3.62e e e-54 High tissue specificity 1.75e e e-31 Short protein length 1.78e e e-18 Low protein disorder 1.29e e e-70 Supplementary Table 7 Biases of d0 proteins remain unchanged if the known human interactome is restricted to PPIs from methods that detect direct binding (e.g., Y2H) or to PPIs from methods that detect protein complexes (e.g., Co-IP). Our study defined d0 proteins as proteins absent from the known interactome, represented by the I2D database. PPI databases such as I2D include interactions from methods that detect direct binding or protein complexes. Complexes are represented as binary interactions using a spoke model, which assumes that the bait interacts with all complex members. Thus, our d0 proteins had neither direct nor indirect (i.e., spoke model) evidence for interaction. Biases of d0 proteins, along with corresponding P-values are shown in columns 1 and 2. We investigated whether these biases would remain unchanged if the known interactome comprised a subset of PPIs: ones with evidence for direct binding, or ones based on spoke models of detected complexes. Column 3 shows P-values of proteins, D0 Direct, absent from the set of directly binding PPIs. Column 4 shows P-values of proteins, D0 Indirect, absent from the set of PPIs based on spoke models. Most biases remain

9 significant when the definition of the known human interactome is altered. P-values were calculated by hypergeometric and Mann- Whitney U tests, and adjusted for multiple testing using F Model organism Human genes without 1:1 orthologs Human d0 genes without 1:1 orthologs Deficiency P-value Yeast (S. cerevisiae) (78%) 6446 (89%) 1.26e-174 Worm (C. elegans) (57%) 5363 (74%) 4.20e-305 Fly (D. melanogaster) (53%) 5228 (72%) 1.00e-308 Mouse (M. musculus) 3007 (15%) 2241 (31%) 1.00e-308 Rat (R. norvegicus) 3739 (19%) 2452 (34%) 1.00e-308 Supplementary Table 8 D0 proteins are less likely to have orthologs in model organisms than other human proteins. The above table shows five model organisms (column 1), the total number of human proteins without 1:1 orthologs in these organisms (column 2), corresponding numbers of d0 proteins, and P-values characterizing the deficiency of d0 proteins with model organism orthologs. P-values were calculated by hypergeometric tests and adjusted for multiple testing using FDR. Feature of d0 proteins d0 d1-d4 d5-d15 Protein Age: < 20M yrs 2.56e e e-01 Protein Age: 20M - 90M yrs 6.91e e e-01 Protein Age: 90M - 320M yrs 1.97e e e-01 Protein Age: 320M - 800M yrs 2.71e e e-01 GO CC: extracellular region 5.62e e e-01 GO MF: signal transducer 8.69e e e+00 GO MF: receptor 2.35e e e+00 GO MF: nucleic acid binding 1.06e e e+00 GO MF: transporter 2.27e e e+00 GO BP: lipid metabolism 7.49e e e+00 GO BP: transmembrane transport 5.58e e e+00 GO BP: carbohydrate metabolism 2.52e e e+00 SCOP structural class: membrane 3.62e e e-01 High tissue specificity 8.75e e e-01 Short protein length 1.78e e e-01 Low protein disorder 6.47e e e+00 Supplementary Table 9 Human proteins with few known interactions (degrees 1 through 4) have most of the same biases as d0 proteins, while proteins with higher degrees have few such biases. The above table shows features enriched in d0 proteins (column 1), and enrichment P-values of these features in d0 proteins (column 2), proteins with degrees 1 through 4 (column 3), proteins with degrees 5 through 15 (column 4), and proteins with degrees 16 through 1651 (column 5). Most features (18/22) enriched in d0 proteins are also enriched in proteins with degrees 1 through 4; only 1 feature is enriched in proteins with higher degrees. P-values were calculated by hypergeometric and Mann-Whitney U tests, and adjusted for multiple testing using FDR.

10 Feature of d0 proteins d0 d0 Fp60 Protein Age: < 20M yrs 2.56e e-01 Protein Age: 20M - 90M yrs 6.91e e-01 Protein Age: 90M - 320M yrs 1.97e e-10 Protein Age: 320M - 800M yrs 2.71e e-07 GO CC: extracellular region 5.62e e-04 GO MF: signal transducer 8.69e e+00 GO MF: receptor 2.35e e-01 GO MF: nucleic acid binding 1.06e e+00 GO MF: transporter 2.27e e-03 GO BP: lipid metabolism 7.49e e-01 GO BP: transmembrane transport 5.58e e-01 GO BP: carbohydrate metabolism 2.52e e-01 SCOP structural class: membrane 3.62e e-04 High tissue specificity 8.75e e-01 Short protein length 1.78e e-01 Low protein disorder 6.47e e-12 Supplementary Table 10 D0 proteins with predicted high-confidence interactions have many of the same biases as other d0 proteins. The above table shows features enriched in d0 proteins (column 1), enrichment P-values among all d0 proteins (column 2), and enrichment P-values among d0 proteins in the Fp60 network (i.e., d0 proteins with high-confidence predicted interactions). Supplementary Note 1 1. Features of individual proteins: description and sources Features of individual proteins comprised protein domains, post-translational modifications (PTMs), structuralchemical protein features, and Gene Ontology annotations - cellular component, molecular function, and biological process. Domains were obtained from InterPro 28, version 21.0, and UniProt 8, release Posttranslational modifications were obtained from UniProt 8, release 15.0, and the Human Protein Reference Database (HPRD) 29, release 8.0. The domains and PTMs of a protein were represented by a binary vector; each entry in the vector indicated whether a given domain or PTM was present in the protein. Structural-chemical features were determined from protein sequence by two programs: PSIPRED 30,31, version 26, and the pepstats application from the European Molecular Biology Open Software Suite 32. PSIPRED 30,31 was used to predict the fraction of a protein's residues in disordered regions, alpha helices, beta sheets and coils. Pepstats was used to calculate 11 chemical features: charge, isoelectric point, and the molar percent of each physico-chemical class of amino acid. Each feature calculated by PSIPRED 30,31 and pepstats was discretized into 7 intervals. An interval represented a range of percentiles; for example, the first interval contained values between the 0 th and 2.5 th percentiles. The 7 intervals were defined as follows: [0%,2.5%], [0%,10%], [0%,40%], [40%,60%], [60%,100%], [90%,100%], [97.5%,100%]. The intervals were overlapping, with the goal of capturing different levels of low and high values as well as intermediate values. Human Gene Ontology annotations were downloaded from the Gene Ontology 33 website1 on Apr.21, Proteins were annotated with cellular component, molecular function, and biological process terms specified in the downloaded file, as well ancestors of these terms. 2. Calculating interaction scores from features of individual proteins

11 To predict PPIs from features of individual proteins, we identified pairs of features enriched among known interacting protein pairs (such that each protein in an interacting pair has one of the features), and used these feature pairs as rules for predicting interactions. This approach was originally proposed by Sprinzak et al. 34 and has been used to predict PPIs based on pairs of domains 2,3,34, and pairs of post-translational modifications 3 enriched among known interacting protein pairs. The main aspects of the approach have remained similar across different studies: a single feature type is chosen (e.g., protein domains), a pair of features of this type (one feature on each protein) provides evidence of interaction, and the strength of the evidence is proportional to the enrichment of the feature pair among known interactions. To make predictions more comprehensive, less biased and ultimately more accurate, we extended this approach by considering three possibilities: (1) an interaction could require the presence of several features in a protein, (2) the required features could be of the same or of different types, and (3) the presence of particular features in a pair of proteins could provide evidence for or against interaction. To take into account these possibilities, we made 3 changes to the original approach: 1. we identified sets of features, of the same or different types, which co-occur in proteins; 2. we identified pairs of feature sets that were either enriched or deficient among known PPIs; 3. we filtered resulting feature set pairs to reduce redundancy and improve prediction accuracy Identifying sets of co-occurring features Interactions in which a protein participates may require the presence of several features (e.g., multiple domains) we assume that a group of such features is likely to co-occur together in proteins more frequently than expected by chance. To identify such feature sets we implemented a data-mining algorithm based on frequent pattern growth 35 a method that finds all sets of features that occur at least k times in a dataset, where k is a user-specified threshold. Input to a frequent pattern growth algorithm consists of records, where each record is a set of features. Our input comprised records for 19,698 proteins; each record contained features of a single protein, such as domains and PTMs. Output from frequent pattern growth consists of feature sets that are subsets of at least k records. For example, with a setting of k = 3, we identified the feature set, fmyb DNA binding domain, Homeodomain-related, which meant that these two domains occurred together in at least 3 proteins. Feature sets that have at least k occurrences are referred to as frequent feature sets, and their number of occurrences is referred to as support. To reduce the search time and output of frequent pattern growth, we made several modifications to the algorithm: (1) A feature set was discarded if its support was similar to that of its subsets, i.e., 80% of a subset's support. Such feature sets were not recorded and their supersets were not considered. This was done because a set with similar support to a subset would provide similar information about interaction the two sets would be present on most of the same proteins, and likely in many of the same interacting protein pairs. (2) Whenever a feature f was considered as an extension for feature set FS, the support of FS f had to be substantially higher than expected by chance. The expected support of FS f was calculated as: where N was the number of records in the dataset. To retain feature set FS f and expand it further, the following criteria had to be met: (1)

12 (2) (3) When feature f was considered as an extension for set FS, the probability of sup(fs f) being equal or greater than its observed value, had to be less than a minimum threshold. This probability was calculated with the hypergeometric distribution, H(N,M, n,m), using the following parameters: N = total records (i.e., number of proteins), M = sup(f), n = sup(fs), and m = sup(fs f). If the cumulative probability was > 0.05, feature set FS f was not recorded and its supersets were not considered. After identifying feature sets we annotated all proteins with both their original features and with feature sets. No distinction was made between the two types of annotations - original features were considered as feature sets of length 1. We refer to the feature sets of a protein, simply as features Identifying pairs of enriched and deficient feature sets For all annotated proteins we determined the support of features and feature pairs among training cases (interacting and non-interacting protein pairs). For each feature, i, we determined support values possup i, negsup i, among positive and negative training cases, respectively. Support was defined as the number of protein pairs where at least one of the proteins was annotated with the feature. Similarly, for each pair of features, (i, j), we calculated support values possup ij, negsup ij among positive and negative cases, respectively. In this case, support was defined as the number of protein pairs where one of the proteins had feature i and the other had feature j. For each feature pair, (i, j), we used support values to calculate several measures quantifying enrichment or deficiency of the pair among positive training cases. Two measures, rpos and ppos represented enrichment or deficiency relative to the expected occurrences of (i, j) among positive cases. We defined the expected occurrences of (i, j) among positive cases as:, (3) where npos is the number of positive training cases. rpos was the ratio between the observed and the expected support:. (4) If the value of rpos was greater than 1, ppos was the probability of possup ij being greater than or equal to its observed value, given the values of possup i and possup j. If rpos was less than 1, ppos was the probability of possup ij being less than or equal to its observed value. ppos was calculated by the cumulative hypergeometric distribution with the settings N = npos, M = possup i, n = possup j, m = possup ij. Two other measures, rall and pall, represented enrichment or deficiency of the feature pair (i, j) among positive cases, relative to the expected occurrences of (i, j) among negative cases. The number of expected occurrences among negative cases was defined as:

13 , (5) where nneg is the number of negative training cases. rall was defined as:. (6) rall > 1 indicated enrichment of (i, j) among positive cases, while rall < 1 indicated deficiency. pall was defined as the cumulative hypergeometric probability with parameters N = nall, M = npos, n = possup ij + negsup ij, m = possup ij, where nall is the total number of training cases. If rall was greater than 1 the right tail of the distribution was used, otherwise, the left tail was used. Feature pairs were considered to be enriched or deficient among positive cases if their values of ppos and pall were less than Calculating interaction scores from feature pairs A set, S, of enriched and deficient feature pairs was used to determine interaction scores. To calculate an interaction score for a protein pair P a, P b, feature pairs i, j from S were selected such that i was a feature of one protein, and j was a feature of the other. Among the selected feature pairs, the pairs with the highest and lowest rall values were identified. These pairs, fp max, fp min provided the strongest evidence for and against interaction, respectively. Their rall values, rall max and rall min, were used to set the interaction score as follows:. (7) 2.4. Filtering feature sets Using feature sets of length 1 resulted in a large number of redundant feature pairs: multiple feature pairs present in largely the same protein pairs. The presence of such feature pairs lowered prediction accuracy. This happened when feature pairs predicted the same true positive cases, but different false positive cases. For example, feature pairs i, j and k, l could each be fairly accurate, predicting ntp true positives and nfp false positives with nfp = 1/5 ntp. However, if they predict the same true positives but different false positives, then using them together results in ntp true positives and 2 nfp false positives. Combining more such feature pairs would give a linear increase in the number of false positives. To reduce this problem we ensured that only one feature pair could gain support from a positive training case. This was implemented with a greedy set cover algorithm. All feature pairs from S with rall > 1 were placed in a set P. The feature pair, fp max with the highest rall value was identified and moved from P to a set Q. Positive training cases where fp max was present were identified. If a feature pair in P was present in k of these cases, its possup value was lowered by k. Its rall value was recalculated, and if the rall value reached 1, the feature pair was removed from P. These steps were repeated until set P was empty. Interaction scores for test cases were then determined from feature pairs in Q and feature pairs in S, which had rall values less than Features of protein pairs

14 Features of protein pairs consisted of information about interacting orthologs, paralogs, gene co-expression and network topology. Information about interacting orthologs was taken from I2D 14, version 1.95 on Apr. 3, For a given protein pair, there were 5 interaction scores based on orthology information. If human proteins (i, j) had interacting orthologs (i, j ) in a model organism, sequence identities were determined between i and i, and between j and j. The lower of these sequence identities was used as a score for proteins (i, j). Such scores were determined based on 5 model organisms: mouse, rat, y, worm, and yeast. Sequence identities were obtained from Ensembl BioMart (Ensembl Genes 63). A score based on paralogy data was determined in a similar way: if proteins (i, j) had interacting paralogs (i, j ) in the training data, the lower of the two sequence identities, I(i, i ) and I(j, j ), was used as a score. Sequence identities were obtained from Ensembl BioMart (Ensembl Genes 63). Gene co-expression information was based on 10 gene expression datasets from the Gene Expression Omnibus 36 : GDS596, GDS1221, GDS1289, GDS1329, GDS1618, GDS1730, GDS2250, GDS2545, GDS2780, and GDS2842. Each of these datasets contained over 8,000 genes measured in at least 15 samples. Each dataset was processed by the MAS 5.0 algorithm, using the affy package (version ) in R (version 2.8) 37. Expression levels of each sample were mean-centered and levels from multiple probe sets for the same gene were averaged. Using each dataset, Pearson correlation coefficients were calculated for all available gene pairs; these correlations were used as interaction scores. Topology information was based on a PPI network comprising positive training cases. For a given protein pair (a, b), three interaction scores were calculated based on the known interactors of proteins a and b. The first score was simply the number of interactors shared by the two proteins: where I c is the set of shared interactors of proteins a and b. The second score, from Scott et al. 3, adjusted the number of shared interactors by the degrees of proteins a and b: where E c is the set of edges from proteins a and b to their shared interactors, E a is the set of edges of protein a and E a \E c is the set E a minus the set E c. Proteins a and b received a high score if most of their interactors were the same. The third score, S pshared, estimated the probability that proteins a and b would share at least their observed k common neighbors. This estimate depended on 4 variables: the degrees of a and b, the number of shared neighbors and the degrees of the shared neighbors. To derive this estimate we started with the simplifying assumption that all neighbors of a and b had equal degrees. We defined a such that deg(a) deg(b) and estimated the probability of exactly k shared neighbors as follows: (8) (9), (10) where P( ) is the probability of an interaction between a and a neighbor of b, n b. We defined P( ) as:

15 . (11) If degrees are not equal, the probability of k shared neighbors depends on the degrees of these neighbors. Since we were interested in the probability of the observed shared neighbors, N ba, we defined our required probability, P, as follows: the probability of a and b sharing at least k neighbors, with degrees similar or lower than those in the set N ba. Before calculating this probability we defined three sets containing neighbors of b: set N b containing all neighbors of b, set containing neighbors not shared with protein a, and the previously mentioned set N ba containing neighbors shared with protein a. We estimated P for exactly k shared neighbors as follows: where is the maximum degree of neighbors in the set N ba. (12) where, (13) where is the probability of an interaction between a and the j th neighbor in set. 4. Calculating a probability of interaction from interaction scores For a given protein pair, (i,j), we calculated a single probability of interaction based on interaction scores. This was done in three steps: first, each score was used to calculate a probability of interaction, second, these probabilities were combined into a single probability using a noisy-or model, and lastly, a final probability of interaction was calculated, taking into account the distribution of noisy-or probabilities among training cases, and the frequency of interactions among human protein pairs. In the first step, for each score, s(i,j),k, a probability of interaction was calculated as (14) In the second step, probabilities from all scores were integrated into a single probability using a noisy-or model 38,39 : P(i,j),noisy-OR = 1 -, (15) where n is the number of scores. In the third step a final probability of interaction was calculated as (16)

16 Lastly, this probability was adjusted to account for the fact that the frequency of positive cases in training data, 1:100, is likely higher than the frequency of interactions among human protein pairs, which we assumed to be 1:600. We viewed this adjustment as recalculating the posterior probability,, using a prior of 1/601 rather than 1/101; we refer to the unadjusted probability as P100 and to the adjusted probability as P600. First, we calculated a likelihood ratio, LR, based on P100. Next, we calculated P600 based on LR and a prior probability of 1/601. where,, References 1. D Haeseleer, P. & Church, G. M. Estimating and improving protein interaction error rates. Proc IEEE Comput Syst Bioinform Conf (2004). 2. Rhodes, D. R. et al. Probabilistic model of the human protein-protein interaction network. Nat Biotechnol 23, (2005). 3. Scott, M. S. & Barton, G. J. Probabilistic prediction and ranking of human protein-protein interactions. BMC Bioinformatics 8, 239 (2007). 4. Elefsinioti, A. et al. Large-scale de novo prediction of physical protein-protein association. Mol Cell Proteomics 10, M (2011). 5. Zhang, Q. C. et al. Structure-based prediction of protein-protein interactions on a genome-wide scale. Nature 490, (2012). 6. Imming, P., Sinning, C. & Meyer, A. Drugs, their targets and the nature and number of drug targets. Nat Rev Drug Discov 5, (2006). 7. Andreeva, A. et al. Data growth and its impact on the SCOP database: new developments. Nucleic Acids Res 36, D (2008). 8. Consortium, T. U. The Universal Protein Resource (UniProt) in Nucleic Acids Res 38, D142 8 (2010).

17 9. Ward, J. J., Sodhi, J. S., McGuffin, L. J., Buxton, B. F. & Jones, D. T. Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. J Mol Biol 337, (2004). 10. Su, A. I. et al. A gene atlas of the mouse and human protein-encoding transcriptomes. Proc Natl Acad Sci U S A 101, (2004). 11. Roth, R. B. et al. Gene expression analyses reveal molecular relationships among 20 regions of the human CNS. Neurogenetics 7, (2006). 12. Wang, E. T. et al. Alternative isoform regulation in human tissue transcriptomes. Nature 456, (2008). 13. Krupp, M. et al. RNA-Seq Atlas--a reference database for gene expression profiling in normal tissue by next-generation sequencing. Bioinformatics 28, (2012). 14. Brown, K. R. & Jurisica, I. Unequal evolutionary conservation of human protein interactions in interologous networks. Genome Biol 8, R95 (2007). 15. Rual, J. F. et al. Towards a proteome-scale map of the human protein-protein interaction network. Nature 437, (2005). 16. Barrios-Rodiles, M. et al. High-throughput mapping of a dynamic signaling network in mammalian cells. Science (80-. ). 307, (2005). 17. Zhu, H. et al. Global analysis of protein activities using proteome chips. Science (80-. ). 293, (2001). 18. Behrends, C., Sowa, M. E., Gygi, S. P. & Harper, J. W. Network organization of the human autophagy system. Nature 466, (2010). 19. Bouwmeester, T. et al. A physical and functional map of the human TNF-alpha/NF-kappa B signal transduction pathway. Nat Cell Biol 6, (2004). 20. Glatter, T., Wepf, A., Aebersold, R. & Gstaiger, M. An integrated workflow for charting the human interaction proteome: insights into the PP2A system. Mol Syst Biol 5, 237 (2009). 21. Hutchins, J. R. et al. Systematic analysis of human protein complexes identifies chromosome segregation proteins. Science (80-. ). 328, (2010). 22. Jeronimo, C. et al. Systematic analysis of the protein interaction network for the human transcription machinery reveals the identity of the 7SK capping enzyme. Mol Cell 27, (2007). 23. Jorgensen, C. et al. Cell-specific information processing in segregating populations of Eph receptor ephrin-expressing cells. Science (80-. ). 326, (2009). 24. Sowa, M. E., Bennett, E. J., Gygi, S. P. & Harper, J. W. Defining the human deubiquitinating enzyme interaction landscape. Cell 138, (2009).

18 25. Xiao, K. et al. Functional specialization of beta-arrestin interactions revealed by proteomic analysis. Proc Natl Acad Sci U S A 104, (2007). 26. Bandyopadhyay, S. et al. A human MAP kinase interactome. Nat Methods 7, (2010). 27. Wang, J. et al. Toward an understanding of the protein interaction network of the human liver. Mol Syst Biol 7, 536 (2011). 28. Hunter, S. et al. InterPro: the integrative protein signature database. Nucleic Acids Res 37, D211 5 (2009). 29. Keshava Prasad, T. S. et al. Human Protein Reference Database update. Nucleic Acids Res 37, D (2009). 30. Bryson, K. et al. Protein structure prediction servers at University College London. Nucleic Acids Res 33, W36 8 (2005). 31. Jones, D. T. Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol 292, (1999). 32. Rice, P., Longden, I. & Bleasby, A. EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet 16, (2000). 33. Consortium, G. O. Creating the gene ontology resource: design and implementation. Genome Res 11, (2001). 34. Sprinzak, E. & Margalit, H. Correlated sequence-signatures as markers of protein-protein interaction. J Mol Biol 311, (2001). 35. Han, J., Pei, J. & Yin, Y. Mining frequent patterns without candidate generation (2000). 36. Barrett, T. et al. NCBI GEO: mining millions of expression profiles--database and tools. Nucleic Acids Res 33, D562 6 (2005). 37. Team, R. D. C. R: A Language and Environment for Statistical Computing. (2008). 38. Kim, J. H. & Pearl, J. A computational model for causal and diagnostic reasoning in inference system. IJCAI (1983). 39. Parsons, S. & Bigham, J. Possibility theory and the generalised Noisy OR model. in Proc. 6th Int. Conf. Inf. Process. Manag. Uncertain (1996).

2 The Proteome. The Proteome 15

2 The Proteome. The Proteome 15 The Proteome 15 2 The Proteome 2.1. The Proteome and the Genome Each of our cells contains all the information necessary to make a complete human being. However, not all the genes are expressed in all