Nature Methods: doi: /nmeth Supplementary Figure 1

Size: px
Start display at page:

Download "Nature Methods: doi: /nmeth Supplementary Figure 1"

Transcription

1 Supplementary Figure 1 Estimating FDR of PPI predictions. (a-b) We used the approach of D'Haeseleer and Church 1 to estimate FDR. This approach calculates the FDR of a PPI dataset, D, by analyzing intersections among three PPI datasets, D, R, and D, where R is a reference set of trusted PPIs and D is a set of PPIs from a method similar to that of D. It is assumed that the overlap of any two datasets contains largely true positive PPIs. The number of nonoverlapping true positives, IV, is calculated from the numbers of shared PPIs: IV = (II III) / I. Then, the number of false positives, V, and the FDR are calculated. The FDR tends to be low if D has a high overlap with either D or R. (c)to calculate the FDR of FpClass we initially set D to our top 35,000 proteome-wide predictions, excluding any PPIs used in training; (we subsequently calculated FDR for larger sets of FpClass predictions (panels d-g)). We defined R as a set of experimentally detected interactions and D as the union of high confidence predictions from previous studies by Rhodes et al., , Scott et al., , Elefsinioti et al., , and Zhang et al., Using a similar approach, we calculated FDRs for high-confidence predictions from these previous studies. For example, to calculate the FDR for Rhodes et al. 2, we defined D as high-confidence predictions from that study, and D as the union of top FpClass predictions and high-confidence predictions from the three remaining previous studies. To ensure that estimated FDRs were not due to biases of a particular reference set, we repeated FDR calculations using 6 reference sets. We calculated FDRs using each reference set, except when the intersection of datasets D, D', and R comprised less than 5 PPIs. In such cases the FDR is indicated as NA. (d-g) Using the approach of D'Haeseleer and Church 1, we estimated FDRs of predicted networks of various sizes from FpClass and four previous prediction methods. The approach of D'Haeseleer and Church 1 requires a trusted reference set of PPIs. We tried four ways of defining this set: (d) using six reference sets (panel c) individually, and then calculating the median of the six resulting FDR estimates, (e) using the union of PPIs from methods that detect direct interactions (Y2H and LUMIER reference sets), (f) using the union of our six reference sets, and (g) using the union of Y2H reference sets. 1 D haeseleer, P. & Church, G. M. Estimating and improving protein interaction error rates. Proc IEEE Comput Syst Bioinform Conf (2004). 2 Rhodes, D. R. et al. Probabilistic model of the human protein-protein interaction network. Nat Biotechnol 23, (2005). 3 Scott, M. S. & Barton, G. J. Probabilistic prediction and ranking of human protein-protein interactions. BMC Bioinformatics 8, 239 (2007). 4 Elefsinioti, A. et al. Large-scale de novo prediction of physical protein-protein association. Mol Cell Proteomics 10, M (2011). 5 Zhang, Q. C. et al. Structure-based prediction of protein-protein interactions on a genome-wide scale. Nature 490, (2012).

2 Supplementary Figure 2 Experimental validation of PPI predictions. (a) Predicted interactions tested by Co-IP assays. (b-c) Predicted interactions tested by GST pull-down assays. (d) Predicted interaction partners of p53 include some of its known partners and d0 proteins. The x-axis indicates the number of top predicted partners, ranked from 1 to The y-axis indicates the number of known partners and d0 proteins, among the top predicted partners.

3 Supplementary Figure 3 Top Gene Ontology (GO) categories among d0 genes. (a-c) GO analysis includes genes without GO annotations. (d-f) GO analysis excludes genes without GO annotations. P-values were calculated by hypergeometric tests and adjusted for multiple testing using FDR.

4 Supplementary Figure 4 Percentages of d0 proteins in drug-target classes and structural properties of d0 proteins. (a) Main drug target classes and (b) receptor drug target classes, as defined by Imming et al. 6. Dashed lines indicate the percentage of d0 proteins in the proteome. P-values were calculated by hypergeometric tests and adjusted for multiple testing using FDR. (c) SCOP 7 structural classes. P-values were calculated by hypergeometric tests and adjusted for multiple testing using FDR. (d) Protein lengths from UniProt 8 and (e) protein disorder, predicted with DISOPRED 9. P-values for protein length and disorder were calculated by two-sided Mann-Whitney U tests. 6 Imming, P., Sinning, C. & Meyer, A. Drugs, their targets and the nature and number of drug targets. Nat Rev Drug Discov 5, (2006). 7 Andreeva, A. et al. Data growth and its impact on the SCOP database: new developments. Nucleic Acids Res 36, D (2008). 8 The UniProt Consortium. The Universal Protein Resource (UniProt) in Nucleic Acids Res 38, D142 8 (2010). 9 Ward, J. J., Sodhi, J. S., McGuffin, L. J., Buxton, B. F. & Jones, D. T. Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. J Mol Biol 337, (2004).

5 Supplementary Figure 5 Median and maximum expression of d0- and dk-encoding genes. P-values were calculated by two-sided Mann-Whitney U tests. (a-d) Median expression of d0 and dk genes in healthy human tissues. Gene expression data was taken from (a) Su et al., , (b) Roth et al., , (c) Wang et al., , and (d) Krupp et al., (e-h) Maximum expression of d0 and dk genes in the same datasets. 10 Su, A. I. et al. A gene atlas of the mouse and human protein-encoding transcriptomes. Proc Natl Acad Sci U S A 101, (2004). 11 Roth, R. B. et al. Gene expression analyses reveal molecular relationships among 20 regions of the human CNS. Neurogenetics 7, (2006). 12 Wang, E. T. et al. Alternative isoform regulation in human tissue transcriptomes. Nature 456, (2008). 13 Krupp, M. et al. RNA-Seq Atlas--a reference database for gene expression profiling in normal tissue by next-generation sequencing. Bioinformatics 28, (2012).

6 Reference Set Name PPI Detection Method Number of PPIs PPI Sources LUMIER LUMIER 756 Barrios-Rodiles et al., MS HT mass spectrometry PPI Detection Miller et al., ,740 Behrends et al., Bouwmeester et al., Glatter et al., Hutchins et al., Jeronimo et al., Jorgensen et al., Sowa et al., Xiao et al., Small-scale screens Various 4,547 compiled in I2D 14 Y2H Bandyopadhyay10 yeast 2-hybrid 1,781 Bandyopadhyay et al., Y2H CCSB HI yeast 2-hybrid 12,227 The Center for Cancer Systems Biology (CCSB) at the Dana-Farber Cancer Institute harvard.edu/index.php?page= login&lg=/h_sapiens/index. php?page=newrelease Y2H Wang11 yeast 2-hybrid 3,160 Wang et al., Supplementary Table 1 Reference sets used for evaluating PPI prediction methods. Six reference sets were used to evaluate FpClass and previous prediction methods by Rhodes et al., , Scott et al., , Elefsinioti et al., , and Zhang et al., The number of PPIs in a reference set is the union of PPIs from the reference set's sources, excluding any PPIs used in the training of prediction methods. MS data (protein complexes) were converted to binary interactions using a spoke model, where the bait is assumed to interact with all members of the complex. Protein Symbol d0 Score Detected Receptor-interacting serine/threonine-protein kinase 1 RIPK1 no yes Caspase-8 CASP8 no no Serine/threonine-protein kinase PAK 1 PAK1 no yes Bcl-2-like protein 1 BCL2L1 no no Induced myeloid leukemia cell differentiation protein Mcl-1 MCL1 no no Supplementary Table 2 Five predicted PYCARD (PYD And CARD Domain Containing) interactions were tested by Co-IP assays and 2 were confirmed. The first 2 columns show the protein name and gene symbol of predicted interaction partners of PYCARD. Column 3 indicates whether the predicted partner is a d0 protein, column 4 shows the score of the interaction, and the last column shows whether binding to PYCARD was detected by Co-IP assays. I2D version Human PPIs Human proteins I2D ,713 9,799 I2D ver (used in analysis) 114,906 14,109 I2D ver ,831 14,565 Supplementary Table 3 Numbers of experimentally detected human PPIs and proteins in I2D 14. PPIs predicted from interacting orthologs in model organisms are not included.

7 Domain InterPro ID Description P-Value Olfactory receptor IPR olfactory receptor 4.84e-152 GPCR, rhodopsin-like, 7TM IPR hormone, neurotransmitter and light receptors 3.55e-76 Krueppel-associated box IPR nucleic acid binding 4.16e-20 Keratin, high sulphur B2 protein IPR synthesized during differentiation of hair matrix cells 2.71e-16 Mammalian taste receptor IPR taste receptor 1.06e-09 Zinc finger, C2H2-type/integrase, DNAbinding IPR nucleic acid binding 1.48e-09 Major facilitator superfamily, general IPR secondary membrane transport 9.20e-06 substrate transporter Zinc finger, C2H2-like IPR DNA-binding motif in eukaryotic transcription factors 6.03e-05 Cadherin, N-terminal IPR Ca2+-dependent cell-cell adhesion 2.20e-4 GAGE IPR exclusively in humans; unknown function; implicated in 3.63e-4 cancers UDP-glucuronosyl/UDP-glucosyltransferase IPR transferase activity, transfer of hexosyl groups 1.70e-3 Major facilitator superfamily MFS-1 IPR transmembrane transport 3.80e-3 Peptidase M12B, ADAM-TS IPR metallopeptidase activity; zinc ion binding 5.23e-3 ADAM-TS Spacer IPR metalloendopeptidase activity; implicated in some cancers 1.01e-2 and inflammatory diseases Cytochrome P450 IPR oxidation-reduction 1.84e-2 Sulfotransferase domain IPR sulfotransferase activity 2.44e-2 Supplementary Table 4 InterPro annotations enriched among interactome orphans. P-values were calculated by hypergeometric tests and adjusted for multiple testing using FDR. PTM Total proteins with PTM D0 proteins with PTM Deficiency P-value Acetylation e-219 Dephosphorylation e-20 Disulfide Bridge e-22 Glycosylation e-54 Methylation e-15 Myristoylation e-08 Palmitoylation e-09 Phosphorylation e-308 Prenylation e-308 Proteolytic Cleavage e-61 S-Nitrosylation e-15 Sumoylation e-308 Ubiquitination e-308 Supplementary Table 5 D0 proteins are annotated with fewer post-translational modifications (PTMs) than other proteins. Shown are the 13 most frequent PTMs of human proteins (column 1), the numbers of human proteins with PTMs (column 2), the corresponding numbers of d0 proteins (column 3), and P-values characterizing the deficiency of PTMs among d0 proteins. P-values were calculated by hypergeometric tests and adjusted for multiple testing using FDR.

8 Feature of d0 proteins d0 d0 Rual05 d0 CCSB HI Protein Age: < 20M yrs 2.56e e e-01 Protein Age: 20M - 90M yrs 6.91e e e-01 Protein Age: 90M - 320M yrs 1.97e e e-01 Protein Age: 320M - 800M yrs 2.71e e e-04 GO CC: extracellular region 5.62e e e-19 GO MF: signal transducer 8.69e e e-12 GO MF: receptor 2.35e e e-29 GO MF: nucleic acid binding GO MF: transporter GO BP: lipid metabolism GO BP: transmembrane transport GO BP: carbohydrate metabolism SCOP structural class: membrane High tissue specificity Short protein length Low protein disorder 1.06e e e e e e e e e e e e e e e e e e e e e e e e e e e-27 Supplementary Table 6 Interaction detection methods are partly responsible for biases of d0 proteins. To investigate the link between detection methods and d0 biases, we analyzed two high-throughput (HT) human Y2H screens: Rual et al. 15 and the Center for Cancer Systems Biology Human Interactome (CCSB HI). For each screen we defined a set of degree 0 proteins comprising proteins that were tested in the screen but for which no interactions were detected. We then tested whether features enriched in d0 proteins (column 2) were also enriched in degree 0 proteins from Rual et al. 15 (column 3) and from CCSB HI (column 4). For most features (e.g., Protein Age: < 20M yrs), P- values were calculated using hypergeometric probability. For three features, high tissue specificity, short protein length, and low protein disorder, P-values were calculated using the Mann-Whitney U test and indicate whether degree 0 proteins had significantly lower (higher) values than proteins with degrees > 0 in the same screen(s) (e.g., column 3 compares proteins with degrees = 0 to proteins with degrees > 0 in Rual et al. 15 ). P-values were adjusted for multiple testing using FDR. Feature of d0 proteins d0 d0 Direct d0 Indirect Protein Age: < 20M yrs 2.56e e e-06 Protein Age: 20M - 90M yrs 6.91e e e-16 Protein Age: 90M - 320M yrs 1.97e e e-81 Protein Age: 320M - 800M yrs 2.71e e e-17 GO CC: extracellular region 5.62e e e-31 GO MF: signal transducer 8.69e e e-33 GO MF: receptor 2.35e e e-42 GO MF: nucleic acid binding 1.06e e E+00 GO MF: transporter 2.27e e e-07 GO BP: lipid metabolism 7.49e e e-05 GO BP: transmembrane transport 5.58e e e-05 GO BP: carbohydrate metabolism 2.52e e e-03 SCOP structural class: membrane 3.62e e e-54 High tissue specificity 1.75e e e-31 Short protein length 1.78e e e-18 Low protein disorder 1.29e e e-70 Supplementary Table 7 Biases of d0 proteins remain unchanged if the known human interactome is restricted to PPIs from methods that detect direct binding (e.g., Y2H) or to PPIs from methods that detect protein complexes (e.g., Co-IP). Our study defined d0 proteins as proteins absent from the known interactome, represented by the I2D database. PPI databases such as I2D include interactions from methods that detect direct binding or protein complexes. Complexes are represented as binary interactions using a spoke model, which assumes that the bait interacts with all complex members. Thus, our d0 proteins had neither direct nor indirect (i.e., spoke model) evidence for interaction. Biases of d0 proteins, along with corresponding P-values are shown in columns 1 and 2. We investigated whether these biases would remain unchanged if the known interactome comprised a subset of PPIs: ones with evidence for direct binding, or ones based on spoke models of detected complexes. Column 3 shows P-values of proteins, D0 Direct, absent from the set of directly binding PPIs. Column 4 shows P-values of proteins, D0 Indirect, absent from the set of PPIs based on spoke models. Most biases remain

9 significant when the definition of the known human interactome is altered. P-values were calculated by hypergeometric and Mann- Whitney U tests, and adjusted for multiple testing using F Model organism Human genes without 1:1 orthologs Human d0 genes without 1:1 orthologs Deficiency P-value Yeast (S. cerevisiae) (78%) 6446 (89%) 1.26e-174 Worm (C. elegans) (57%) 5363 (74%) 4.20e-305 Fly (D. melanogaster) (53%) 5228 (72%) 1.00e-308 Mouse (M. musculus) 3007 (15%) 2241 (31%) 1.00e-308 Rat (R. norvegicus) 3739 (19%) 2452 (34%) 1.00e-308 Supplementary Table 8 D0 proteins are less likely to have orthologs in model organisms than other human proteins. The above table shows five model organisms (column 1), the total number of human proteins without 1:1 orthologs in these organisms (column 2), corresponding numbers of d0 proteins, and P-values characterizing the deficiency of d0 proteins with model organism orthologs. P-values were calculated by hypergeometric tests and adjusted for multiple testing using FDR. Feature of d0 proteins d0 d1-d4 d5-d15 Protein Age: < 20M yrs 2.56e e e-01 Protein Age: 20M - 90M yrs 6.91e e e-01 Protein Age: 90M - 320M yrs 1.97e e e-01 Protein Age: 320M - 800M yrs 2.71e e e-01 GO CC: extracellular region 5.62e e e-01 GO MF: signal transducer 8.69e e e+00 GO MF: receptor 2.35e e e+00 GO MF: nucleic acid binding 1.06e e e+00 GO MF: transporter 2.27e e e+00 GO BP: lipid metabolism 7.49e e e+00 GO BP: transmembrane transport 5.58e e e+00 GO BP: carbohydrate metabolism 2.52e e e+00 SCOP structural class: membrane 3.62e e e-01 High tissue specificity 8.75e e e-01 Short protein length 1.78e e e-01 Low protein disorder 6.47e e e+00 Supplementary Table 9 Human proteins with few known interactions (degrees 1 through 4) have most of the same biases as d0 proteins, while proteins with higher degrees have few such biases. The above table shows features enriched in d0 proteins (column 1), and enrichment P-values of these features in d0 proteins (column 2), proteins with degrees 1 through 4 (column 3), proteins with degrees 5 through 15 (column 4), and proteins with degrees 16 through 1651 (column 5). Most features (18/22) enriched in d0 proteins are also enriched in proteins with degrees 1 through 4; only 1 feature is enriched in proteins with higher degrees. P-values were calculated by hypergeometric and Mann-Whitney U tests, and adjusted for multiple testing using FDR.

10 Feature of d0 proteins d0 d0 Fp60 Protein Age: < 20M yrs 2.56e e-01 Protein Age: 20M - 90M yrs 6.91e e-01 Protein Age: 90M - 320M yrs 1.97e e-10 Protein Age: 320M - 800M yrs 2.71e e-07 GO CC: extracellular region 5.62e e-04 GO MF: signal transducer 8.69e e+00 GO MF: receptor 2.35e e-01 GO MF: nucleic acid binding 1.06e e+00 GO MF: transporter 2.27e e-03 GO BP: lipid metabolism 7.49e e-01 GO BP: transmembrane transport 5.58e e-01 GO BP: carbohydrate metabolism 2.52e e-01 SCOP structural class: membrane 3.62e e-04 High tissue specificity 8.75e e-01 Short protein length 1.78e e-01 Low protein disorder 6.47e e-12 Supplementary Table 10 D0 proteins with predicted high-confidence interactions have many of the same biases as other d0 proteins. The above table shows features enriched in d0 proteins (column 1), enrichment P-values among all d0 proteins (column 2), and enrichment P-values among d0 proteins in the Fp60 network (i.e., d0 proteins with high-confidence predicted interactions). Supplementary Note 1 1. Features of individual proteins: description and sources Features of individual proteins comprised protein domains, post-translational modifications (PTMs), structuralchemical protein features, and Gene Ontology annotations - cellular component, molecular function, and biological process. Domains were obtained from InterPro 28, version 21.0, and UniProt 8, release Posttranslational modifications were obtained from UniProt 8, release 15.0, and the Human Protein Reference Database (HPRD) 29, release 8.0. The domains and PTMs of a protein were represented by a binary vector; each entry in the vector indicated whether a given domain or PTM was present in the protein. Structural-chemical features were determined from protein sequence by two programs: PSIPRED 30,31, version 26, and the pepstats application from the European Molecular Biology Open Software Suite 32. PSIPRED 30,31 was used to predict the fraction of a protein's residues in disordered regions, alpha helices, beta sheets and coils. Pepstats was used to calculate 11 chemical features: charge, isoelectric point, and the molar percent of each physico-chemical class of amino acid. Each feature calculated by PSIPRED 30,31 and pepstats was discretized into 7 intervals. An interval represented a range of percentiles; for example, the first interval contained values between the 0 th and 2.5 th percentiles. The 7 intervals were defined as follows: [0%,2.5%], [0%,10%], [0%,40%], [40%,60%], [60%,100%], [90%,100%], [97.5%,100%]. The intervals were overlapping, with the goal of capturing different levels of low and high values as well as intermediate values. Human Gene Ontology annotations were downloaded from the Gene Ontology 33 website1 on Apr.21, Proteins were annotated with cellular component, molecular function, and biological process terms specified in the downloaded file, as well ancestors of these terms. 2. Calculating interaction scores from features of individual proteins

11 To predict PPIs from features of individual proteins, we identified pairs of features enriched among known interacting protein pairs (such that each protein in an interacting pair has one of the features), and used these feature pairs as rules for predicting interactions. This approach was originally proposed by Sprinzak et al. 34 and has been used to predict PPIs based on pairs of domains 2,3,34, and pairs of post-translational modifications 3 enriched among known interacting protein pairs. The main aspects of the approach have remained similar across different studies: a single feature type is chosen (e.g., protein domains), a pair of features of this type (one feature on each protein) provides evidence of interaction, and the strength of the evidence is proportional to the enrichment of the feature pair among known interactions. To make predictions more comprehensive, less biased and ultimately more accurate, we extended this approach by considering three possibilities: (1) an interaction could require the presence of several features in a protein, (2) the required features could be of the same or of different types, and (3) the presence of particular features in a pair of proteins could provide evidence for or against interaction. To take into account these possibilities, we made 3 changes to the original approach: 1. we identified sets of features, of the same or different types, which co-occur in proteins; 2. we identified pairs of feature sets that were either enriched or deficient among known PPIs; 3. we filtered resulting feature set pairs to reduce redundancy and improve prediction accuracy Identifying sets of co-occurring features Interactions in which a protein participates may require the presence of several features (e.g., multiple domains) we assume that a group of such features is likely to co-occur together in proteins more frequently than expected by chance. To identify such feature sets we implemented a data-mining algorithm based on frequent pattern growth 35 a method that finds all sets of features that occur at least k times in a dataset, where k is a user-specified threshold. Input to a frequent pattern growth algorithm consists of records, where each record is a set of features. Our input comprised records for 19,698 proteins; each record contained features of a single protein, such as domains and PTMs. Output from frequent pattern growth consists of feature sets that are subsets of at least k records. For example, with a setting of k = 3, we identified the feature set, fmyb DNA binding domain, Homeodomain-related, which meant that these two domains occurred together in at least 3 proteins. Feature sets that have at least k occurrences are referred to as frequent feature sets, and their number of occurrences is referred to as support. To reduce the search time and output of frequent pattern growth, we made several modifications to the algorithm: (1) A feature set was discarded if its support was similar to that of its subsets, i.e., 80% of a subset's support. Such feature sets were not recorded and their supersets were not considered. This was done because a set with similar support to a subset would provide similar information about interaction the two sets would be present on most of the same proteins, and likely in many of the same interacting protein pairs. (2) Whenever a feature f was considered as an extension for feature set FS, the support of FS f had to be substantially higher than expected by chance. The expected support of FS f was calculated as: where N was the number of records in the dataset. To retain feature set FS f and expand it further, the following criteria had to be met: (1)

12 (2) (3) When feature f was considered as an extension for set FS, the probability of sup(fs f) being equal or greater than its observed value, had to be less than a minimum threshold. This probability was calculated with the hypergeometric distribution, H(N,M, n,m), using the following parameters: N = total records (i.e., number of proteins), M = sup(f), n = sup(fs), and m = sup(fs f). If the cumulative probability was > 0.05, feature set FS f was not recorded and its supersets were not considered. After identifying feature sets we annotated all proteins with both their original features and with feature sets. No distinction was made between the two types of annotations - original features were considered as feature sets of length 1. We refer to the feature sets of a protein, simply as features Identifying pairs of enriched and deficient feature sets For all annotated proteins we determined the support of features and feature pairs among training cases (interacting and non-interacting protein pairs). For each feature, i, we determined support values possup i, negsup i, among positive and negative training cases, respectively. Support was defined as the number of protein pairs where at least one of the proteins was annotated with the feature. Similarly, for each pair of features, (i, j), we calculated support values possup ij, negsup ij among positive and negative cases, respectively. In this case, support was defined as the number of protein pairs where one of the proteins had feature i and the other had feature j. For each feature pair, (i, j), we used support values to calculate several measures quantifying enrichment or deficiency of the pair among positive training cases. Two measures, rpos and ppos represented enrichment or deficiency relative to the expected occurrences of (i, j) among positive cases. We defined the expected occurrences of (i, j) among positive cases as:, (3) where npos is the number of positive training cases. rpos was the ratio between the observed and the expected support:. (4) If the value of rpos was greater than 1, ppos was the probability of possup ij being greater than or equal to its observed value, given the values of possup i and possup j. If rpos was less than 1, ppos was the probability of possup ij being less than or equal to its observed value. ppos was calculated by the cumulative hypergeometric distribution with the settings N = npos, M = possup i, n = possup j, m = possup ij. Two other measures, rall and pall, represented enrichment or deficiency of the feature pair (i, j) among positive cases, relative to the expected occurrences of (i, j) among negative cases. The number of expected occurrences among negative cases was defined as:

13 , (5) where nneg is the number of negative training cases. rall was defined as:. (6) rall > 1 indicated enrichment of (i, j) among positive cases, while rall < 1 indicated deficiency. pall was defined as the cumulative hypergeometric probability with parameters N = nall, M = npos, n = possup ij + negsup ij, m = possup ij, where nall is the total number of training cases. If rall was greater than 1 the right tail of the distribution was used, otherwise, the left tail was used. Feature pairs were considered to be enriched or deficient among positive cases if their values of ppos and pall were less than Calculating interaction scores from feature pairs A set, S, of enriched and deficient feature pairs was used to determine interaction scores. To calculate an interaction score for a protein pair P a, P b, feature pairs i, j from S were selected such that i was a feature of one protein, and j was a feature of the other. Among the selected feature pairs, the pairs with the highest and lowest rall values were identified. These pairs, fp max, fp min provided the strongest evidence for and against interaction, respectively. Their rall values, rall max and rall min, were used to set the interaction score as follows:. (7) 2.4. Filtering feature sets Using feature sets of length 1 resulted in a large number of redundant feature pairs: multiple feature pairs present in largely the same protein pairs. The presence of such feature pairs lowered prediction accuracy. This happened when feature pairs predicted the same true positive cases, but different false positive cases. For example, feature pairs i, j and k, l could each be fairly accurate, predicting ntp true positives and nfp false positives with nfp = 1/5 ntp. However, if they predict the same true positives but different false positives, then using them together results in ntp true positives and 2 nfp false positives. Combining more such feature pairs would give a linear increase in the number of false positives. To reduce this problem we ensured that only one feature pair could gain support from a positive training case. This was implemented with a greedy set cover algorithm. All feature pairs from S with rall > 1 were placed in a set P. The feature pair, fp max with the highest rall value was identified and moved from P to a set Q. Positive training cases where fp max was present were identified. If a feature pair in P was present in k of these cases, its possup value was lowered by k. Its rall value was recalculated, and if the rall value reached 1, the feature pair was removed from P. These steps were repeated until set P was empty. Interaction scores for test cases were then determined from feature pairs in Q and feature pairs in S, which had rall values less than Features of protein pairs

14 Features of protein pairs consisted of information about interacting orthologs, paralogs, gene co-expression and network topology. Information about interacting orthologs was taken from I2D 14, version 1.95 on Apr. 3, For a given protein pair, there were 5 interaction scores based on orthology information. If human proteins (i, j) had interacting orthologs (i, j ) in a model organism, sequence identities were determined between i and i, and between j and j. The lower of these sequence identities was used as a score for proteins (i, j). Such scores were determined based on 5 model organisms: mouse, rat, y, worm, and yeast. Sequence identities were obtained from Ensembl BioMart (Ensembl Genes 63). A score based on paralogy data was determined in a similar way: if proteins (i, j) had interacting paralogs (i, j ) in the training data, the lower of the two sequence identities, I(i, i ) and I(j, j ), was used as a score. Sequence identities were obtained from Ensembl BioMart (Ensembl Genes 63). Gene co-expression information was based on 10 gene expression datasets from the Gene Expression Omnibus 36 : GDS596, GDS1221, GDS1289, GDS1329, GDS1618, GDS1730, GDS2250, GDS2545, GDS2780, and GDS2842. Each of these datasets contained over 8,000 genes measured in at least 15 samples. Each dataset was processed by the MAS 5.0 algorithm, using the affy package (version ) in R (version 2.8) 37. Expression levels of each sample were mean-centered and levels from multiple probe sets for the same gene were averaged. Using each dataset, Pearson correlation coefficients were calculated for all available gene pairs; these correlations were used as interaction scores. Topology information was based on a PPI network comprising positive training cases. For a given protein pair (a, b), three interaction scores were calculated based on the known interactors of proteins a and b. The first score was simply the number of interactors shared by the two proteins: where I c is the set of shared interactors of proteins a and b. The second score, from Scott et al. 3, adjusted the number of shared interactors by the degrees of proteins a and b: where E c is the set of edges from proteins a and b to their shared interactors, E a is the set of edges of protein a and E a \E c is the set E a minus the set E c. Proteins a and b received a high score if most of their interactors were the same. The third score, S pshared, estimated the probability that proteins a and b would share at least their observed k common neighbors. This estimate depended on 4 variables: the degrees of a and b, the number of shared neighbors and the degrees of the shared neighbors. To derive this estimate we started with the simplifying assumption that all neighbors of a and b had equal degrees. We defined a such that deg(a) deg(b) and estimated the probability of exactly k shared neighbors as follows: (8) (9), (10) where P( ) is the probability of an interaction between a and a neighbor of b, n b. We defined P( ) as:

15 . (11) If degrees are not equal, the probability of k shared neighbors depends on the degrees of these neighbors. Since we were interested in the probability of the observed shared neighbors, N ba, we defined our required probability, P, as follows: the probability of a and b sharing at least k neighbors, with degrees similar or lower than those in the set N ba. Before calculating this probability we defined three sets containing neighbors of b: set N b containing all neighbors of b, set containing neighbors not shared with protein a, and the previously mentioned set N ba containing neighbors shared with protein a. We estimated P for exactly k shared neighbors as follows: where is the maximum degree of neighbors in the set N ba. (12) where, (13) where is the probability of an interaction between a and the j th neighbor in set. 4. Calculating a probability of interaction from interaction scores For a given protein pair, (i,j), we calculated a single probability of interaction based on interaction scores. This was done in three steps: first, each score was used to calculate a probability of interaction, second, these probabilities were combined into a single probability using a noisy-or model, and lastly, a final probability of interaction was calculated, taking into account the distribution of noisy-or probabilities among training cases, and the frequency of interactions among human protein pairs. In the first step, for each score, s(i,j),k, a probability of interaction was calculated as (14) In the second step, probabilities from all scores were integrated into a single probability using a noisy-or model 38,39 : P(i,j),noisy-OR = 1 -, (15) where n is the number of scores. In the third step a final probability of interaction was calculated as (16)

16 Lastly, this probability was adjusted to account for the fact that the frequency of positive cases in training data, 1:100, is likely higher than the frequency of interactions among human protein pairs, which we assumed to be 1:600. We viewed this adjustment as recalculating the posterior probability,, using a prior of 1/601 rather than 1/101; we refer to the unadjusted probability as P100 and to the adjusted probability as P600. First, we calculated a likelihood ratio, LR, based on P100. Next, we calculated P600 based on LR and a prior probability of 1/601. where,, References 1. D Haeseleer, P. & Church, G. M. Estimating and improving protein interaction error rates. Proc IEEE Comput Syst Bioinform Conf (2004). 2. Rhodes, D. R. et al. Probabilistic model of the human protein-protein interaction network. Nat Biotechnol 23, (2005). 3. Scott, M. S. & Barton, G. J. Probabilistic prediction and ranking of human protein-protein interactions. BMC Bioinformatics 8, 239 (2007). 4. Elefsinioti, A. et al. Large-scale de novo prediction of physical protein-protein association. Mol Cell Proteomics 10, M (2011). 5. Zhang, Q. C. et al. Structure-based prediction of protein-protein interactions on a genome-wide scale. Nature 490, (2012). 6. Imming, P., Sinning, C. & Meyer, A. Drugs, their targets and the nature and number of drug targets. Nat Rev Drug Discov 5, (2006). 7. Andreeva, A. et al. Data growth and its impact on the SCOP database: new developments. Nucleic Acids Res 36, D (2008). 8. Consortium, T. U. The Universal Protein Resource (UniProt) in Nucleic Acids Res 38, D142 8 (2010).

17 9. Ward, J. J., Sodhi, J. S., McGuffin, L. J., Buxton, B. F. & Jones, D. T. Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. J Mol Biol 337, (2004). 10. Su, A. I. et al. A gene atlas of the mouse and human protein-encoding transcriptomes. Proc Natl Acad Sci U S A 101, (2004). 11. Roth, R. B. et al. Gene expression analyses reveal molecular relationships among 20 regions of the human CNS. Neurogenetics 7, (2006). 12. Wang, E. T. et al. Alternative isoform regulation in human tissue transcriptomes. Nature 456, (2008). 13. Krupp, M. et al. RNA-Seq Atlas--a reference database for gene expression profiling in normal tissue by next-generation sequencing. Bioinformatics 28, (2012). 14. Brown, K. R. & Jurisica, I. Unequal evolutionary conservation of human protein interactions in interologous networks. Genome Biol 8, R95 (2007). 15. Rual, J. F. et al. Towards a proteome-scale map of the human protein-protein interaction network. Nature 437, (2005). 16. Barrios-Rodiles, M. et al. High-throughput mapping of a dynamic signaling network in mammalian cells. Science (80-. ). 307, (2005). 17. Zhu, H. et al. Global analysis of protein activities using proteome chips. Science (80-. ). 293, (2001). 18. Behrends, C., Sowa, M. E., Gygi, S. P. & Harper, J. W. Network organization of the human autophagy system. Nature 466, (2010). 19. Bouwmeester, T. et al. A physical and functional map of the human TNF-alpha/NF-kappa B signal transduction pathway. Nat Cell Biol 6, (2004). 20. Glatter, T., Wepf, A., Aebersold, R. & Gstaiger, M. An integrated workflow for charting the human interaction proteome: insights into the PP2A system. Mol Syst Biol 5, 237 (2009). 21. Hutchins, J. R. et al. Systematic analysis of human protein complexes identifies chromosome segregation proteins. Science (80-. ). 328, (2010). 22. Jeronimo, C. et al. Systematic analysis of the protein interaction network for the human transcription machinery reveals the identity of the 7SK capping enzyme. Mol Cell 27, (2007). 23. Jorgensen, C. et al. Cell-specific information processing in segregating populations of Eph receptor ephrin-expressing cells. Science (80-. ). 326, (2009). 24. Sowa, M. E., Bennett, E. J., Gygi, S. P. & Harper, J. W. Defining the human deubiquitinating enzyme interaction landscape. Cell 138, (2009).

18 25. Xiao, K. et al. Functional specialization of beta-arrestin interactions revealed by proteomic analysis. Proc Natl Acad Sci U S A 104, (2007). 26. Bandyopadhyay, S. et al. A human MAP kinase interactome. Nat Methods 7, (2010). 27. Wang, J. et al. Toward an understanding of the protein interaction network of the human liver. Mol Syst Biol 7, 536 (2011). 28. Hunter, S. et al. InterPro: the integrative protein signature database. Nucleic Acids Res 37, D211 5 (2009). 29. Keshava Prasad, T. S. et al. Human Protein Reference Database update. Nucleic Acids Res 37, D (2009). 30. Bryson, K. et al. Protein structure prediction servers at University College London. Nucleic Acids Res 33, W36 8 (2005). 31. Jones, D. T. Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol 292, (1999). 32. Rice, P., Longden, I. & Bleasby, A. EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet 16, (2000). 33. Consortium, G. O. Creating the gene ontology resource: design and implementation. Genome Res 11, (2001). 34. Sprinzak, E. & Margalit, H. Correlated sequence-signatures as markers of protein-protein interaction. J Mol Biol 311, (2001). 35. Han, J., Pei, J. & Yin, Y. Mining frequent patterns without candidate generation (2000). 36. Barrett, T. et al. NCBI GEO: mining millions of expression profiles--database and tools. Nucleic Acids Res 33, D562 6 (2005). 37. Team, R. D. C. R: A Language and Environment for Statistical Computing. (2008). 38. Kim, J. H. & Pearl, J. A computational model for causal and diagnostic reasoning in inference system. IJCAI (1983). 39. Parsons, S. & Bigham, J. Possibility theory and the generalised Noisy OR model. in Proc. 6th Int. Conf. Inf. Process. Manag. Uncertain (1996).

2 The Proteome. The Proteome 15

2 The Proteome. The Proteome 15 The Proteome 15 2 The Proteome 2.1. The Proteome and the Genome Each of our cells contains all the information necessary to make a complete human being. However, not all the genes are expressed in all

More information

Comparison of Human Protein-Protein Interaction Maps

Comparison of Human Protein-Protein Interaction Maps Comparison of Human Protein-Protein Interaction Maps Matthias E. Futschik 1, Gautam Chaurasia 1,2, Erich Wanker 2 and Hanspeter Herzel 1 1 Institute for Theoretical Biology, Charité, Humboldt-Universität

More information

Types of biological networks. I. Intra-cellurar networks

Types of biological networks. I. Intra-cellurar networks Types of biological networks I. Intra-cellurar networks 1 Some intra-cellular networks: 1. Metabolic networks 2. Transcriptional regulation networks 3. Cell signalling networks 4. Protein-protein interaction

More information

GRAPH-THEORETICAL COMPARISON REVEALS STRUCTURAL DIVERGENCE OF HUMAN PROTEIN INTERACTION NETWORKS

GRAPH-THEORETICAL COMPARISON REVEALS STRUCTURAL DIVERGENCE OF HUMAN PROTEIN INTERACTION NETWORKS 141 GRAPH-THEORETICAL COMPARISON REVEALS STRUCTURAL DIVERGENCE OF HUMAN PROTEIN INTERACTION NETWORKS MATTHIAS E. FUTSCHIK 1 ANNA TSCHAUT 2 m.futschik@staff.hu-berlin.de tschaut@zedat.fu-berlin.de GAUTAM

More information

Proteomics. Areas of Interest

Proteomics. Areas of Interest Introduction to BioMEMS & Medical Microdevices Proteomics and Protein Microarrays Companion lecture to the textbook: Fundamentals of BioMEMS and Medical Microdevices, by Prof., http://saliterman.umn.edu/

More information

Nature Structural and Molecular Biology: doi: /nsmb Supplementary Figure 1

Nature Structural and Molecular Biology: doi: /nsmb Supplementary Figure 1 Supplementary Figure 1 SUMOylation of proteins changes drastically upon heat shock, MG-132 treatment and PR-619 treatment. (a) Schematic overview of all SUMOylation proteins identified to be differentially

More information

CSCE555 Bioinformatics. Protein Function Annotation

CSCE555 Bioinformatics. Protein Function Annotation CSCE555 Bioinformatics Protein Function Annotation Why we need to do function annotation? Fig from: Network-based prediction of protein function. Molecular Systems Biology 3:88. 2007 What s function? The

More information

EBI web resources II: Ensembl and InterPro

EBI web resources II: Ensembl and InterPro EBI web resources II: Ensembl and InterPro Yanbin Yin http://www.ebi.ac.uk/training/online/course/ 1 Homework 3 Go to http://www.ebi.ac.uk/interpro/training.htmland finish the second online training course

More information

Supplementary text for the section Interactions conserved across species: can one select the conserved interactions?

Supplementary text for the section Interactions conserved across species: can one select the conserved interactions? 1 Supporting Information: What Evidence is There for the Homology of Protein-Protein Interactions? Anna C. F. Lewis, Nick S. Jones, Mason A. Porter, Charlotte M. Deane Supplementary text for the section

More information

Towards Detecting Protein Complexes from Protein Interaction Data

Towards Detecting Protein Complexes from Protein Interaction Data Towards Detecting Protein Complexes from Protein Interaction Data Pengjun Pei 1 and Aidong Zhang 1 Department of Computer Science and Engineering State University of New York at Buffalo Buffalo NY 14260,

More information

Comparative RNA-seq analysis of transcriptome dynamics during petal development in Rosa chinensis

Comparative RNA-seq analysis of transcriptome dynamics during petal development in Rosa chinensis Title Comparative RNA-seq analysis of transcriptome dynamics during petal development in Rosa chinensis Author list Yu Han 1, Huihua Wan 1, Tangren Cheng 1, Jia Wang 1, Weiru Yang 1, Huitang Pan 1* & Qixiang

More information

Introduction to Bioinformatics

Introduction to Bioinformatics CSCI8980: Applied Machine Learning in Computational Biology Introduction to Bioinformatics Rui Kuang Department of Computer Science and Engineering University of Minnesota kuang@cs.umn.edu History of Bioinformatics

More information

Computational methods for predicting protein-protein interactions

Computational methods for predicting protein-protein interactions Computational methods for predicting protein-protein interactions Tomi Peltola T-61.6070 Special course in bioinformatics I 3.4.2008 Outline Biological background Protein-protein interactions Computational

More information

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison 10-810: Advanced Algorithms and Models for Computational Biology microrna and Whole Genome Comparison Central Dogma: 90s Transcription factors DNA transcription mrna translation Proteins Central Dogma:

More information

Predicting Protein Functions and Domain Interactions from Protein Interactions

Predicting Protein Functions and Domain Interactions from Protein Interactions Predicting Protein Functions and Domain Interactions from Protein Interactions Fengzhu Sun, PhD Center for Computational and Experimental Genomics University of Southern California Outline High-throughput

More information

S1 Gene ontology (GO) analysis of the network alignment results

S1 Gene ontology (GO) analysis of the network alignment results 1 Supplementary Material for Effective comparative analysis of protein-protein interaction networks by measuring the steady-state network flow using a Markov model Hyundoo Jeong 1, Xiaoning Qian 1 and

More information

Systems biology and biological networks

Systems biology and biological networks Systems Biology Workshop Systems biology and biological networks Center for Biological Sequence Analysis Networks in electronics Radio kindly provided by Lazebnik, Cancer Cell, 2002 Systems Biology Workshop,

More information

Genomics and bioinformatics summary. Finding genes -- computer searches

Genomics and bioinformatics summary. Finding genes -- computer searches Genomics and bioinformatics summary 1. Gene finding: computer searches, cdnas, ESTs, 2. Microarrays 3. Use BLAST to find homologous sequences 4. Multiple sequence alignments (MSAs) 5. Trees quantify sequence

More information

Network Biology-part II

Network Biology-part II Network Biology-part II Jun Zhu, Ph. D. Professor of Genomics and Genetic Sciences Icahn Institute of Genomics and Multi-scale Biology The Tisch Cancer Institute Icahn Medical School at Mount Sinai New

More information

Supplementary Information 16

Supplementary Information 16 Supplementary Information 16 Cellular Component % of Genes 50 45 40 35 30 25 20 15 10 5 0 human mouse extracellular other membranes plasma membrane cytosol cytoskeleton mitochondrion ER/Golgi translational

More information

Gene Ontology and Functional Enrichment. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein

Gene Ontology and Functional Enrichment. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein Gene Ontology and Functional Enrichment Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein The parsimony principle: A quick review Find the tree that requires the fewest

More information

BMD645. Integration of Omics

BMD645. Integration of Omics BMD645 Integration of Omics Shu-Jen Chen, Chang Gung University Dec. 11, 2009 1 Traditional Biology vs. Systems Biology Traditional biology : Single genes or proteins Systems biology: Simultaneously study

More information

Comparison of Protein-Protein Interaction Confidence Assignment Schemes

Comparison of Protein-Protein Interaction Confidence Assignment Schemes Comparison of Protein-Protein Interaction Confidence Assignment Schemes Silpa Suthram 1, Tomer Shlomi 2, Eytan Ruppin 2, Roded Sharan 2, and Trey Ideker 1 1 Department of Bioengineering, University of

More information

Bioinformatics. Dept. of Computational Biology & Bioinformatics

Bioinformatics. Dept. of Computational Biology & Bioinformatics Bioinformatics Dept. of Computational Biology & Bioinformatics 3 Bioinformatics - play with sequences & structures Dept. of Computational Biology & Bioinformatics 4 ORGANIZATION OF LIFE ROLE OF BIOINFORMATICS

More information

1-D Predictions. Prediction of local features: Secondary structure & surface exposure

1-D Predictions. Prediction of local features: Secondary structure & surface exposure 1-D Predictions Prediction of local features: Secondary structure & surface exposure 1 Learning Objectives After today s session you should be able to: Explain the meaning and usage of the following local

More information

Robust Community Detection Methods with Resolution Parameter for Complex Detection in Protein Protein Interaction Networks

Robust Community Detection Methods with Resolution Parameter for Complex Detection in Protein Protein Interaction Networks Robust Community Detection Methods with Resolution Parameter for Complex Detection in Protein Protein Interaction Networks Twan van Laarhoven and Elena Marchiori Institute for Computing and Information

More information

Graph Alignment and Biological Networks

Graph Alignment and Biological Networks Graph Alignment and Biological Networks Johannes Berg http://www.uni-koeln.de/ berg Institute for Theoretical Physics University of Cologne Germany p.1/12 Networks in molecular biology New large-scale

More information

Genome Annotation. Bioinformatics and Computational Biology. Genome sequencing Assembly. Gene prediction. Protein targeting.

Genome Annotation. Bioinformatics and Computational Biology. Genome sequencing Assembly. Gene prediction. Protein targeting. Genome Annotation Bioinformatics and Computational Biology Genome Annotation Frank Oliver Glöckner 1 Genome Analysis Roadmap Genome sequencing Assembly Gene prediction Protein targeting trna prediction

More information

Networks & pathways. Hedi Peterson MTAT Bioinformatics

Networks & pathways. Hedi Peterson MTAT Bioinformatics Networks & pathways Hedi Peterson (peterson@quretec.com) MTAT.03.239 Bioinformatics 03.11.2010 Networks are graphs Nodes Edges Edges Directed, undirected, weighted Nodes Genes Proteins Metabolites Enzymes

More information

Comparative Features of Multicellular Eukaryotic Genomes

Comparative Features of Multicellular Eukaryotic Genomes Comparative Features of Multicellular Eukaryotic Genomes C elegans A thaliana O. Sativa D. melanogaster M. musculus H. sapiens Size (Mb) 97 115 389 120 2500 2900 # Genes 18,425 25,498 37,544 13,601 30,000

More information

Integrative Protein Function Transfer using Factor Graphs and Heterogeneous Data Sources

Integrative Protein Function Transfer using Factor Graphs and Heterogeneous Data Sources Integrative Protein Function Transfer using Factor Graphs and Heterogeneous Data Sources Antonina Mitrofanova New York University antonina@cs.nyu.edu Vladimir Pavlovic Rutgers University vladimir@cs.rutgers.edu

More information

Research Article HomoKinase: A Curated Database of Human Protein Kinases

Research Article HomoKinase: A Curated Database of Human Protein Kinases ISRN Computational Biology Volume 2013, Article ID 417634, 5 pages http://dx.doi.org/10.1155/2013/417634 Research Article HomoKinase: A Curated Database of Human Protein Kinases Suresh Subramani, Saranya

More information

Identifying Signaling Pathways

Identifying Signaling Pathways These slides, excluding third-party material, are licensed under CC BY-NC 4.0 by Anthony Gitter, Mark Craven, Colin Dewey Identifying Signaling Pathways BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 2018

More information

The geneticist s questions

The geneticist s questions The geneticist s questions a) What is consequence of reduced gene function? 1) gene knockout (deletion, RNAi) b) What is the consequence of increased gene function? 2) gene overexpression c) What does

More information

Regulation of gene expression. Premedical - Biology

Regulation of gene expression. Premedical - Biology Regulation of gene expression Premedical - Biology Regulation of gene expression in prokaryotic cell Operon units system of negative feedback positive and negative regulation in eukaryotic cell - at any

More information

Lecture 10: May 19, High-Throughput technologies for measuring proteinprotein

Lecture 10: May 19, High-Throughput technologies for measuring proteinprotein Analysis of Gene Expression Data Spring Semester, 2005 Lecture 10: May 19, 2005 Lecturer: Roded Sharan Scribe: Daniela Raijman and Igor Ulitsky 10.1 Protein Interaction Networks In the past we have discussed

More information

Proteomics Systems Biology

Proteomics Systems Biology Dr. Sanjeeva Srivastava IIT Bombay Proteomics Systems Biology IIT Bombay 2 1 DNA Genomics RNA Transcriptomics Global Cellular Protein Proteomics Global Cellular Metabolite Metabolomics Global Cellular

More information

Prediction and Classif ication of Human G-protein Coupled Receptors Based on Support Vector Machines

Prediction and Classif ication of Human G-protein Coupled Receptors Based on Support Vector Machines Article Prediction and Classif ication of Human G-protein Coupled Receptors Based on Support Vector Machines Yun-Fei Wang, Huan Chen, and Yan-Hong Zhou* Hubei Bioinformatics and Molecular Imaging Key Laboratory,

More information

Chapter 15 Active Reading Guide Regulation of Gene Expression

Chapter 15 Active Reading Guide Regulation of Gene Expression Name: AP Biology Mr. Croft Chapter 15 Active Reading Guide Regulation of Gene Expression The overview for Chapter 15 introduces the idea that while all cells of an organism have all genes in the genome,

More information

Yifei Bao. Beatrix. Manor Askenazi

Yifei Bao. Beatrix. Manor Askenazi Detection and Correction of Interference in MS1 Quantitation of Peptides Using their Isotope Distributions Yifei Bao Department of Computer Science Stevens Institute of Technology Beatrix Ueberheide Department

More information

Protein-protein interaction networks Prof. Peter Csermely

Protein-protein interaction networks Prof. Peter Csermely Protein-Protein Interaction Networks 1 Department of Medical Chemistry Semmelweis University, Budapest, Hungary www.linkgroup.hu csermely@eok.sote.hu Advantages of multi-disciplinarity Networks have general

More information

Markov Random Field Models of Transient Interactions Between Protein Complexes in Yeast

Markov Random Field Models of Transient Interactions Between Protein Complexes in Yeast Markov Random Field Models of Transient Interactions Between Protein Complexes in Yeast Boyko Kakaradov Department of Computer Science, Stanford University June 10, 2008 Motivation: Mapping all transient

More information

Ensembl focuses on metazoan (animal) genomes. The genomes currently available at the Ensembl site are:

Ensembl focuses on metazoan (animal) genomes. The genomes currently available at the Ensembl site are: Comparative genomics and proteomics Species available Ensembl focuses on metazoan (animal) genomes. The genomes currently available at the Ensembl site are: Vertebrates: human, chimpanzee, mouse, rat,

More information

A Multiobjective GO based Approach to Protein Complex Detection

A Multiobjective GO based Approach to Protein Complex Detection Available online at www.sciencedirect.com Procedia Technology 4 (2012 ) 555 560 C3IT-2012 A Multiobjective GO based Approach to Protein Complex Detection Sumanta Ray a, Moumita De b, Anirban Mukhopadhyay

More information

Gene Control Mechanisms at Transcription and Translation Levels

Gene Control Mechanisms at Transcription and Translation Levels Gene Control Mechanisms at Transcription and Translation Levels Dr. M. Vijayalakshmi School of Chemical and Biotechnology SASTRA University Joint Initiative of IITs and IISc Funded by MHRD Page 1 of 9

More information

Evidence for dynamically organized modularity in the yeast protein-protein interaction network

Evidence for dynamically organized modularity in the yeast protein-protein interaction network Evidence for dynamically organized modularity in the yeast protein-protein interaction network Sari Bombino Helsinki 27.3.2007 UNIVERSITY OF HELSINKI Department of Computer Science Seminar on Computational

More information

Hands-On Nine The PAX6 Gene and Protein

Hands-On Nine The PAX6 Gene and Protein Hands-On Nine The PAX6 Gene and Protein Main Purpose of Hands-On Activity: Using bioinformatics tools to examine the sequences, homology, and disease relevance of the Pax6: a master gene of eye formation.

More information

Prediction of protein function from sequence analysis

Prediction of protein function from sequence analysis Prediction of protein function from sequence analysis Rita Casadio BIOCOMPUTING GROUP University of Bologna, Italy The omic era Genome Sequencing Projects: Archaea: 74 species In Progress:52 Bacteria:

More information

Inferring Protein-Signaling Networks

Inferring Protein-Signaling Networks Inferring Protein-Signaling Networks Lectures 14 Nov 14, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall (JHN) 022 1

More information

GENE ONTOLOGY (GO) Wilver Martínez Martínez Giovanny Silva Rincón

GENE ONTOLOGY (GO) Wilver Martínez Martínez Giovanny Silva Rincón GENE ONTOLOGY (GO) Wilver Martínez Martínez Giovanny Silva Rincón What is GO? The Gene Ontology (GO) project is a collaborative effort to address the need for consistent descriptions of gene products in

More information

Proteomics. Yeast two hybrid. Proteomics - PAGE techniques. Data obtained. What is it?

Proteomics. Yeast two hybrid. Proteomics - PAGE techniques. Data obtained. What is it? Proteomics What is it? Reveal protein interactions Protein profiling in a sample Yeast two hybrid screening High throughput 2D PAGE Automatic analysis of 2D Page Yeast two hybrid Use two mating strains

More information

Procedure to Create NCBI KOGS

Procedure to Create NCBI KOGS Procedure to Create NCBI KOGS full details in: Tatusov et al (2003) BMC Bioinformatics 4:41. 1. Detect and mask typical repetitive domains Reason: masking prevents spurious lumping of non-orthologs based

More information

Smart pooling for interactome mapping

Smart pooling for interactome mapping Smart pooling for interactome mapping Nicolas Thierry Mieg CNRS / TIMC IMAG / TIMB, Grenoble collaboration with Marc Vidal, CCSB / DFCI, Boston TSB Workshop, Grenoble 10/10/2007 Rual et al, Nature 2005

More information

Small RNA in rice genome

Small RNA in rice genome Vol. 45 No. 5 SCIENCE IN CHINA (Series C) October 2002 Small RNA in rice genome WANG Kai ( 1, ZHU Xiaopeng ( 2, ZHONG Lan ( 1,3 & CHEN Runsheng ( 1,2 1. Beijing Genomics Institute/Center of Genomics and

More information

The Eukaryotic Genome and Its Expression. The Eukaryotic Genome and Its Expression. A. The Eukaryotic Genome. Lecture Series 11

The Eukaryotic Genome and Its Expression. The Eukaryotic Genome and Its Expression. A. The Eukaryotic Genome. Lecture Series 11 The Eukaryotic Genome and Its Expression Lecture Series 11 The Eukaryotic Genome and Its Expression A. The Eukaryotic Genome B. Repetitive Sequences (rem: teleomeres) C. The Structures of Protein-Coding

More information

Organization of Genes Differs in Prokaryotic and Eukaryotic DNA Chapter 10 p

Organization of Genes Differs in Prokaryotic and Eukaryotic DNA Chapter 10 p Organization of Genes Differs in Prokaryotic and Eukaryotic DNA Chapter 10 p.110-114 Arrangement of information in DNA----- requirements for RNA Common arrangement of protein-coding genes in prokaryotes=

More information

Bioinformatics: Network Analysis

Bioinformatics: Network Analysis Bioinformatics: Network Analysis Comparative Network Analysis COMP 572 (BIOS 572 / BIOE 564) - Fall 2013 Luay Nakhleh, Rice University 1 Biomolecular Network Components 2 Accumulation of Network Components

More information

Integrative Protein Function Transfer using Factor Graphs and Heterogeneous Data Sources

Integrative Protein Function Transfer using Factor Graphs and Heterogeneous Data Sources Integrative Protein Function Transfer using Factor Graphs and Heterogeneous Data Sources Antonina Mitrofanova New York University antonina@cs.nyu.edu Vladimir Pavlovic Rutgers University vladimir@cs.rutgers.edu

More information

Number sequence representation of protein structures based on the second derivative of a folded tetrahedron sequence

Number sequence representation of protein structures based on the second derivative of a folded tetrahedron sequence Number sequence representation of protein structures based on the second derivative of a folded tetrahedron sequence Naoto Morikawa (nmorika@genocript.com) October 7, 2006. Abstract A protein is a sequence

More information

INTERACTIVE CLUSTERING FOR EXPLORATION OF GENOMIC DATA

INTERACTIVE CLUSTERING FOR EXPLORATION OF GENOMIC DATA INTERACTIVE CLUSTERING FOR EXPLORATION OF GENOMIC DATA XIUFENG WAN xw6@cs.msstate.edu Department of Computer Science Box 9637 JOHN A. BOYLE jab@ra.msstate.edu Department of Biochemistry and Molecular Biology

More information

Drosophila melanogaster and D. simulans, two fruit fly species that are nearly

Drosophila melanogaster and D. simulans, two fruit fly species that are nearly Comparative Genomics: Human versus chimpanzee 1. Introduction The chimpanzee is the closest living relative to humans. The two species are nearly identical in DNA sequence (>98% identity), yet vastly different

More information

Systematic prediction of gene function in Arabidopsis thaliana using a probabilistic functional gene network

Systematic prediction of gene function in Arabidopsis thaliana using a probabilistic functional gene network Systematic prediction of gene function in Arabidopsis thaliana using a probabilistic functional gene network Sohyun Hwang 1, Seung Y Rhee 2, Edward M Marcotte 3,4 & Insuk Lee 1 protocol 1 Department of

More information

Analysis of Biological Networks: Network Robustness and Evolution

Analysis of Biological Networks: Network Robustness and Evolution Analysis of Biological Networks: Network Robustness and Evolution Lecturer: Roded Sharan Scribers: Sasha Medvedovsky and Eitan Hirsh Lecture 14, February 2, 2006 1 Introduction The chapter is divided into

More information

Lecture 3: A basic statistical concept

Lecture 3: A basic statistical concept Lecture 3: A basic statistical concept P value In statistical hypothesis testing, the p value is the probability of obtaining a result at least as extreme as the one that was actually observed, assuming

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION Supplementary information S1 (box). Supplementary Methods description. Prokaryotic Genome Database Archaeal and bacterial genome sequences were downloaded from the NCBI FTP site (ftp://ftp.ncbi.nlm.nih.gov/genomes/all/)

More information

Supplementary Information

Supplementary Information Supplementary Information Supplementary Figure 1. Schematic pipeline for single-cell genome assembly, cleaning and annotation. a. The assembly process was optimized to account for multiple cells putatively

More information

Statistical Machine Learning Methods for Bioinformatics IV. Neural Network & Deep Learning Applications in Bioinformatics

Statistical Machine Learning Methods for Bioinformatics IV. Neural Network & Deep Learning Applications in Bioinformatics Statistical Machine Learning Methods for Bioinformatics IV. Neural Network & Deep Learning Applications in Bioinformatics Jianlin Cheng, PhD Department of Computer Science University of Missouri, Columbia

More information

Analysis and visualization of protein-protein interactions. Olga Vitek Assistant Professor Statistics and Computer Science

Analysis and visualization of protein-protein interactions. Olga Vitek Assistant Professor Statistics and Computer Science 1 Analysis and visualization of protein-protein interactions Olga Vitek Assistant Professor Statistics and Computer Science 2 Outline 1. Protein-protein interactions 2. Using graph structures to study

More information

SCOP. all-β class. all-α class, 3 different folds. T4 endonuclease V. 4-helical cytokines. Globin-like

SCOP. all-β class. all-α class, 3 different folds. T4 endonuclease V. 4-helical cytokines. Globin-like SCOP all-β class 4-helical cytokines T4 endonuclease V all-α class, 3 different folds Globin-like TIM-barrel fold α/β class Profilin-like fold α+β class http://scop.mrc-lmb.cam.ac.uk/scop CATH Class, Architecture,

More information

CATEGORY a TERM COUNT b P VALUE GENE FAMILIES: REPRESENTATIVE GENE SYMBOLS c. Annotation Cluster 1 Enrichment Score: 1.

CATEGORY a TERM COUNT b P VALUE GENE FAMILIES: REPRESENTATIVE GENE SYMBOLS c. Annotation Cluster 1 Enrichment Score: 1. Table S7 GO term and functional classification enrichment analysis using DAVID for gene families that are expanded in the Drosophila suzukii genome as compared to 14 Drosophila species analyzed in this

More information

Written Exam 15 December Course name: Introduction to Systems Biology Course no

Written Exam 15 December Course name: Introduction to Systems Biology Course no Technical University of Denmark Written Exam 15 December 2008 Course name: Introduction to Systems Biology Course no. 27041 Aids allowed: Open book exam Provide your answers and calculations on separate

More information

Bioinformatics Chapter 1. Introduction

Bioinformatics Chapter 1. Introduction Bioinformatics Chapter 1. Introduction Outline! Biological Data in Digital Symbol Sequences! Genomes Diversity, Size, and Structure! Proteins and Proteomes! On the Information Content of Biological Sequences!

More information

Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Information #

Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Information # Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Details of PRF Methodology In the Poisson Random Field PRF) model, it is assumed that non-synonymous mutations at a given gene are either

More information

Dynamic modular architecture of protein-protein interaction networks beyond the dichotomy of date and party hubs

Dynamic modular architecture of protein-protein interaction networks beyond the dichotomy of date and party hubs Dynamic modular architecture of protein-protein interaction networks beyond the dichotomy of date and party hubs Xiao Chang 1,#, Tao Xu 2,#, Yun Li 3, Kai Wang 1,4,5,* 1 Zilkha Neurogenetic Institute,

More information

-max_target_seqs: maximum number of targets to report

-max_target_seqs: maximum number of targets to report Review of exercise 1 tblastn -num_threads 2 -db contig -query DH10B.fasta -out blastout.xls -evalue 1e-10 -outfmt "6 qseqid sseqid qstart qend sstart send length nident pident evalue" Other options: -max_target_seqs:

More information

Computational Analyses of High-Throughput Protein-Protein Interaction Data

Computational Analyses of High-Throughput Protein-Protein Interaction Data Current Protein and Peptide Science, 2003, 4, 159-181 159 Computational Analyses of High-Throughput Protein-Protein Interaction Data Yu Chen 1, 2 and Dong Xu 1, 2 * 1 Protein Informatics Group, Life Sciences

More information

Motif Prediction in Amino Acid Interaction Networks

Motif Prediction in Amino Acid Interaction Networks Motif Prediction in Amino Acid Interaction Networks Omar GACI and Stefan BALEV Abstract In this paper we represent a protein as a graph where the vertices are amino acids and the edges are interactions

More information

Bahnson Biochemistry Cume, April 8, 2006 The Structural Biology of Signal Transduction

Bahnson Biochemistry Cume, April 8, 2006 The Structural Biology of Signal Transduction Name page 1 of 6 Bahnson Biochemistry Cume, April 8, 2006 The Structural Biology of Signal Transduction Part I. The ion Ca 2+ can function as a 2 nd messenger. Pick a specific signal transduction pathway

More information

Lecture Notes for Fall Network Modeling. Ernest Fraenkel

Lecture Notes for Fall Network Modeling. Ernest Fraenkel Lecture Notes for 20.320 Fall 2012 Network Modeling Ernest Fraenkel In this lecture we will explore ways in which network models can help us to understand better biological data. We will explore how networks

More information

Biological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor

Biological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor Biological Networks:,, and via Relative Description Length By: Tamir Tuller & Benny Chor Presented by: Noga Grebla Content of the presentation Presenting the goals of the research Reviewing basic terms

More information

Regulation and signaling. Overview. Control of gene expression. Cells need to regulate the amounts of different proteins they express, depending on

Regulation and signaling. Overview. Control of gene expression. Cells need to regulate the amounts of different proteins they express, depending on Regulation and signaling Overview Cells need to regulate the amounts of different proteins they express, depending on cell development (skin vs liver cell) cell stage environmental conditions (food, temperature,

More information

Lecture 4: Yeast as a model organism for functional and evolutionary genomics. Part II

Lecture 4: Yeast as a model organism for functional and evolutionary genomics. Part II Lecture 4: Yeast as a model organism for functional and evolutionary genomics Part II A brief review What have we discussed: Yeast genome in a glance Gene expression can tell us about yeast functions Transcriptional

More information

Genome-wide multilevel spatial interactome model of rice

Genome-wide multilevel spatial interactome model of rice Sino-German Workshop on Multiscale Spatial Computational Systems Biology, Beijing, Oct 8-12, 2015 Genome-wide multilevel spatial interactome model of rice Ming CHEN ( 陈铭 ) mchen@zju.edu.cn College of Life

More information

Correspondence of D. melanogaster and C. elegans developmental stages revealed by alternative splicing characteristics of conserved exons

Correspondence of D. melanogaster and C. elegans developmental stages revealed by alternative splicing characteristics of conserved exons Gao and Li BMC Genomics (2017) 18:234 DOI 10.1186/s12864-017-3600-2 RESEARCH ARTICLE Open Access Correspondence of D. melanogaster and C. elegans developmental stages revealed by alternative splicing characteristics

More information

Supplementary information. A proposal for a novel impact factor as an alternative to the JCR impact factor

Supplementary information. A proposal for a novel impact factor as an alternative to the JCR impact factor Supplementary information A proposal for a novel impact factor as an alternative to the JCR impact factor Zu-Guo Yang a and Chun-Ting Zhang b, * a Library, Tianjin University, Tianjin 300072, China b Department

More information

Carri-Lyn Mead Thursday, January 13, 2005 Terry Fox Laboratory, Dr. Dixie Mager

Carri-Lyn Mead Thursday, January 13, 2005 Terry Fox Laboratory, Dr. Dixie Mager Investigating Trends in Transposable Element Insertion within Regulatory Regions Carri-Lyn Mead cmead@bcgsc.ca Thursday, January 13, 2005 Terry Fox Laboratory, Dr. Dixie Mager Outline Transposable Element

More information

Tandem Mass Spectrometry: Generating function, alignment and assembly

Tandem Mass Spectrometry: Generating function, alignment and assembly Tandem Mass Spectrometry: Generating function, alignment and assembly With slides from Sangtae Kim and from Jones & Pevzner 2004 Determining reliability of identifications Can we use Target/Decoy to estimate

More information

Computational Structural Bioinformatics

Computational Structural Bioinformatics Computational Structural Bioinformatics ECS129 Instructor: Patrice Koehl http://koehllab.genomecenter.ucdavis.edu/teaching/ecs129 koehl@cs.ucdavis.edu Learning curve Math / CS Biology/ Chemistry Pre-requisite

More information

Understanding Science Through the Lens of Computation. Richard M. Karp Nov. 3, 2007

Understanding Science Through the Lens of Computation. Richard M. Karp Nov. 3, 2007 Understanding Science Through the Lens of Computation Richard M. Karp Nov. 3, 2007 The Computational Lens Exposes the computational nature of natural processes and provides a language for their description.

More information

BIOINFORMATICS LAB AP BIOLOGY

BIOINFORMATICS LAB AP BIOLOGY BIOINFORMATICS LAB AP BIOLOGY Bioinformatics is the science of collecting and analyzing complex biological data. Bioinformatics combines computer science, statistics and biology to allow scientists to

More information

2. Yeast two-hybrid system

2. Yeast two-hybrid system 2. Yeast two-hybrid system I. Process workflow a. Mating of haploid two-hybrid strains on YPD plates b. Replica-plating of diploids on selective plates c. Two-hydrid experiment plating on selective plates

More information

An Efficient Algorithm for Protein-Protein Interaction Network Analysis to Discover Overlapping Functional Modules

An Efficient Algorithm for Protein-Protein Interaction Network Analysis to Discover Overlapping Functional Modules An Efficient Algorithm for Protein-Protein Interaction Network Analysis to Discover Overlapping Functional Modules Ying Liu 1 Department of Computer Science, Mathematics and Science, College of Professional

More information

Amino Acid Structures from Klug & Cummings. 10/7/2003 CAP/CGS 5991: Lecture 7 1

Amino Acid Structures from Klug & Cummings. 10/7/2003 CAP/CGS 5991: Lecture 7 1 Amino Acid Structures from Klug & Cummings 10/7/2003 CAP/CGS 5991: Lecture 7 1 Amino Acid Structures from Klug & Cummings 10/7/2003 CAP/CGS 5991: Lecture 7 2 Amino Acid Structures from Klug & Cummings

More information

Measuring TF-DNA interactions

Measuring TF-DNA interactions Measuring TF-DNA interactions How is Biological Complexity Achieved? Mediated by Transcription Factors (TFs) 2 Regulation of Gene Expression by Transcription Factors TF trans-acting factors TF TF TF TF

More information

Protein Structures. Sequences of amino acid residues 20 different amino acids. Quaternary. Primary. Tertiary. Secondary. 10/8/2002 Lecture 12 1

Protein Structures. Sequences of amino acid residues 20 different amino acids. Quaternary. Primary. Tertiary. Secondary. 10/8/2002 Lecture 12 1 Protein Structures Sequences of amino acid residues 20 different amino acids Primary Secondary Tertiary Quaternary 10/8/2002 Lecture 12 1 Angles φ and ψ in the polypeptide chain 10/8/2002 Lecture 12 2

More information

Constructing Signal Transduction Networks Using Multiple Signaling Feature Data

Constructing Signal Transduction Networks Using Multiple Signaling Feature Data Constructing Signal Transduction Networks Using Multiple Signaling Feature Data Thanh-Phuong Nguyen 1, Kenji Satou 2, Tu-Bao Ho 1 and Katsuhiko Takabayashi 3 1 Japan Advanced Institute of Science and Technology

More information

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche The molecular structure of a protein can be broken down hierarchically. The primary structure of a protein is simply its

More information

Genome-Scale Gene Function Prediction Using Multiple Sources of High-Throughput Data in Yeast Saccharomyces cerevisiae ABSTRACT

Genome-Scale Gene Function Prediction Using Multiple Sources of High-Throughput Data in Yeast Saccharomyces cerevisiae ABSTRACT OMICS A Journal of Integrative Biology Volume 8, Number 4, 2004 Mary Ann Liebert, Inc. Genome-Scale Gene Function Prediction Using Multiple Sources of High-Throughput Data in Yeast Saccharomyces cerevisiae

More information

Matrix-based pattern discovery algorithms

Matrix-based pattern discovery algorithms Regulatory Sequence Analysis Matrix-based pattern discovery algorithms Jacques.van.Helden@ulb.ac.be Université Libre de Bruxelles, Belgique Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe)

More information

Improved network-based identification of protein orthologs

Improved network-based identification of protein orthologs BIOINFORMATICS Vol. 24 ECCB 28, pages i2 i26 doi:.93/bioinformatics/btn277 Improved network-based identification of protein orthologs Nir Yosef,, Roded Sharan and William Stafford Noble 2,3 School of Computer

More information