Nature Methods: doi: /nmeth Supplementary Figure 1
|
|
- Julius Simon
- 5 years ago
- Views:
Transcription
1 Supplementary Figure 1 Estimating FDR of PPI predictions. (a-b) We used the approach of D'Haeseleer and Church 1 to estimate FDR. This approach calculates the FDR of a PPI dataset, D, by analyzing intersections among three PPI datasets, D, R, and D, where R is a reference set of trusted PPIs and D is a set of PPIs from a method similar to that of D. It is assumed that the overlap of any two datasets contains largely true positive PPIs. The number of nonoverlapping true positives, IV, is calculated from the numbers of shared PPIs: IV = (II III) / I. Then, the number of false positives, V, and the FDR are calculated. The FDR tends to be low if D has a high overlap with either D or R. (c)to calculate the FDR of FpClass we initially set D to our top 35,000 proteome-wide predictions, excluding any PPIs used in training; (we subsequently calculated FDR for larger sets of FpClass predictions (panels d-g)). We defined R as a set of experimentally detected interactions and D as the union of high confidence predictions from previous studies by Rhodes et al., , Scott et al., , Elefsinioti et al., , and Zhang et al., Using a similar approach, we calculated FDRs for high-confidence predictions from these previous studies. For example, to calculate the FDR for Rhodes et al. 2, we defined D as high-confidence predictions from that study, and D as the union of top FpClass predictions and high-confidence predictions from the three remaining previous studies. To ensure that estimated FDRs were not due to biases of a particular reference set, we repeated FDR calculations using 6 reference sets. We calculated FDRs using each reference set, except when the intersection of datasets D, D', and R comprised less than 5 PPIs. In such cases the FDR is indicated as NA. (d-g) Using the approach of D'Haeseleer and Church 1, we estimated FDRs of predicted networks of various sizes from FpClass and four previous prediction methods. The approach of D'Haeseleer and Church 1 requires a trusted reference set of PPIs. We tried four ways of defining this set: (d) using six reference sets (panel c) individually, and then calculating the median of the six resulting FDR estimates, (e) using the union of PPIs from methods that detect direct interactions (Y2H and LUMIER reference sets), (f) using the union of our six reference sets, and (g) using the union of Y2H reference sets. 1 D haeseleer, P. & Church, G. M. Estimating and improving protein interaction error rates. Proc IEEE Comput Syst Bioinform Conf (2004). 2 Rhodes, D. R. et al. Probabilistic model of the human protein-protein interaction network. Nat Biotechnol 23, (2005). 3 Scott, M. S. & Barton, G. J. Probabilistic prediction and ranking of human protein-protein interactions. BMC Bioinformatics 8, 239 (2007). 4 Elefsinioti, A. et al. Large-scale de novo prediction of physical protein-protein association. Mol Cell Proteomics 10, M (2011). 5 Zhang, Q. C. et al. Structure-based prediction of protein-protein interactions on a genome-wide scale. Nature 490, (2012).
2 Supplementary Figure 2 Experimental validation of PPI predictions. (a) Predicted interactions tested by Co-IP assays. (b-c) Predicted interactions tested by GST pull-down assays. (d) Predicted interaction partners of p53 include some of its known partners and d0 proteins. The x-axis indicates the number of top predicted partners, ranked from 1 to The y-axis indicates the number of known partners and d0 proteins, among the top predicted partners.
3 Supplementary Figure 3 Top Gene Ontology (GO) categories among d0 genes. (a-c) GO analysis includes genes without GO annotations. (d-f) GO analysis excludes genes without GO annotations. P-values were calculated by hypergeometric tests and adjusted for multiple testing using FDR.
4 Supplementary Figure 4 Percentages of d0 proteins in drug-target classes and structural properties of d0 proteins. (a) Main drug target classes and (b) receptor drug target classes, as defined by Imming et al. 6. Dashed lines indicate the percentage of d0 proteins in the proteome. P-values were calculated by hypergeometric tests and adjusted for multiple testing using FDR. (c) SCOP 7 structural classes. P-values were calculated by hypergeometric tests and adjusted for multiple testing using FDR. (d) Protein lengths from UniProt 8 and (e) protein disorder, predicted with DISOPRED 9. P-values for protein length and disorder were calculated by two-sided Mann-Whitney U tests. 6 Imming, P., Sinning, C. & Meyer, A. Drugs, their targets and the nature and number of drug targets. Nat Rev Drug Discov 5, (2006). 7 Andreeva, A. et al. Data growth and its impact on the SCOP database: new developments. Nucleic Acids Res 36, D (2008). 8 The UniProt Consortium. The Universal Protein Resource (UniProt) in Nucleic Acids Res 38, D142 8 (2010). 9 Ward, J. J., Sodhi, J. S., McGuffin, L. J., Buxton, B. F. & Jones, D. T. Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. J Mol Biol 337, (2004).
5 Supplementary Figure 5 Median and maximum expression of d0- and dk-encoding genes. P-values were calculated by two-sided Mann-Whitney U tests. (a-d) Median expression of d0 and dk genes in healthy human tissues. Gene expression data was taken from (a) Su et al., , (b) Roth et al., , (c) Wang et al., , and (d) Krupp et al., (e-h) Maximum expression of d0 and dk genes in the same datasets. 10 Su, A. I. et al. A gene atlas of the mouse and human protein-encoding transcriptomes. Proc Natl Acad Sci U S A 101, (2004). 11 Roth, R. B. et al. Gene expression analyses reveal molecular relationships among 20 regions of the human CNS. Neurogenetics 7, (2006). 12 Wang, E. T. et al. Alternative isoform regulation in human tissue transcriptomes. Nature 456, (2008). 13 Krupp, M. et al. RNA-Seq Atlas--a reference database for gene expression profiling in normal tissue by next-generation sequencing. Bioinformatics 28, (2012).
6 Reference Set Name PPI Detection Method Number of PPIs PPI Sources LUMIER LUMIER 756 Barrios-Rodiles et al., MS HT mass spectrometry PPI Detection Miller et al., ,740 Behrends et al., Bouwmeester et al., Glatter et al., Hutchins et al., Jeronimo et al., Jorgensen et al., Sowa et al., Xiao et al., Small-scale screens Various 4,547 compiled in I2D 14 Y2H Bandyopadhyay10 yeast 2-hybrid 1,781 Bandyopadhyay et al., Y2H CCSB HI yeast 2-hybrid 12,227 The Center for Cancer Systems Biology (CCSB) at the Dana-Farber Cancer Institute harvard.edu/index.php?page= login&lg=/h_sapiens/index. php?page=newrelease Y2H Wang11 yeast 2-hybrid 3,160 Wang et al., Supplementary Table 1 Reference sets used for evaluating PPI prediction methods. Six reference sets were used to evaluate FpClass and previous prediction methods by Rhodes et al., , Scott et al., , Elefsinioti et al., , and Zhang et al., The number of PPIs in a reference set is the union of PPIs from the reference set's sources, excluding any PPIs used in the training of prediction methods. MS data (protein complexes) were converted to binary interactions using a spoke model, where the bait is assumed to interact with all members of the complex. Protein Symbol d0 Score Detected Receptor-interacting serine/threonine-protein kinase 1 RIPK1 no yes Caspase-8 CASP8 no no Serine/threonine-protein kinase PAK 1 PAK1 no yes Bcl-2-like protein 1 BCL2L1 no no Induced myeloid leukemia cell differentiation protein Mcl-1 MCL1 no no Supplementary Table 2 Five predicted PYCARD (PYD And CARD Domain Containing) interactions were tested by Co-IP assays and 2 were confirmed. The first 2 columns show the protein name and gene symbol of predicted interaction partners of PYCARD. Column 3 indicates whether the predicted partner is a d0 protein, column 4 shows the score of the interaction, and the last column shows whether binding to PYCARD was detected by Co-IP assays. I2D version Human PPIs Human proteins I2D ,713 9,799 I2D ver (used in analysis) 114,906 14,109 I2D ver ,831 14,565 Supplementary Table 3 Numbers of experimentally detected human PPIs and proteins in I2D 14. PPIs predicted from interacting orthologs in model organisms are not included.
7 Domain InterPro ID Description P-Value Olfactory receptor IPR olfactory receptor 4.84e-152 GPCR, rhodopsin-like, 7TM IPR hormone, neurotransmitter and light receptors 3.55e-76 Krueppel-associated box IPR nucleic acid binding 4.16e-20 Keratin, high sulphur B2 protein IPR synthesized during differentiation of hair matrix cells 2.71e-16 Mammalian taste receptor IPR taste receptor 1.06e-09 Zinc finger, C2H2-type/integrase, DNAbinding IPR nucleic acid binding 1.48e-09 Major facilitator superfamily, general IPR secondary membrane transport 9.20e-06 substrate transporter Zinc finger, C2H2-like IPR DNA-binding motif in eukaryotic transcription factors 6.03e-05 Cadherin, N-terminal IPR Ca2+-dependent cell-cell adhesion 2.20e-4 GAGE IPR exclusively in humans; unknown function; implicated in 3.63e-4 cancers UDP-glucuronosyl/UDP-glucosyltransferase IPR transferase activity, transfer of hexosyl groups 1.70e-3 Major facilitator superfamily MFS-1 IPR transmembrane transport 3.80e-3 Peptidase M12B, ADAM-TS IPR metallopeptidase activity; zinc ion binding 5.23e-3 ADAM-TS Spacer IPR metalloendopeptidase activity; implicated in some cancers 1.01e-2 and inflammatory diseases Cytochrome P450 IPR oxidation-reduction 1.84e-2 Sulfotransferase domain IPR sulfotransferase activity 2.44e-2 Supplementary Table 4 InterPro annotations enriched among interactome orphans. P-values were calculated by hypergeometric tests and adjusted for multiple testing using FDR. PTM Total proteins with PTM D0 proteins with PTM Deficiency P-value Acetylation e-219 Dephosphorylation e-20 Disulfide Bridge e-22 Glycosylation e-54 Methylation e-15 Myristoylation e-08 Palmitoylation e-09 Phosphorylation e-308 Prenylation e-308 Proteolytic Cleavage e-61 S-Nitrosylation e-15 Sumoylation e-308 Ubiquitination e-308 Supplementary Table 5 D0 proteins are annotated with fewer post-translational modifications (PTMs) than other proteins. Shown are the 13 most frequent PTMs of human proteins (column 1), the numbers of human proteins with PTMs (column 2), the corresponding numbers of d0 proteins (column 3), and P-values characterizing the deficiency of PTMs among d0 proteins. P-values were calculated by hypergeometric tests and adjusted for multiple testing using FDR.
8 Feature of d0 proteins d0 d0 Rual05 d0 CCSB HI Protein Age: < 20M yrs 2.56e e e-01 Protein Age: 20M - 90M yrs 6.91e e e-01 Protein Age: 90M - 320M yrs 1.97e e e-01 Protein Age: 320M - 800M yrs 2.71e e e-04 GO CC: extracellular region 5.62e e e-19 GO MF: signal transducer 8.69e e e-12 GO MF: receptor 2.35e e e-29 GO MF: nucleic acid binding GO MF: transporter GO BP: lipid metabolism GO BP: transmembrane transport GO BP: carbohydrate metabolism SCOP structural class: membrane High tissue specificity Short protein length Low protein disorder 1.06e e e e e e e e e e e e e e e e e e e e e e e e e e e-27 Supplementary Table 6 Interaction detection methods are partly responsible for biases of d0 proteins. To investigate the link between detection methods and d0 biases, we analyzed two high-throughput (HT) human Y2H screens: Rual et al. 15 and the Center for Cancer Systems Biology Human Interactome (CCSB HI). For each screen we defined a set of degree 0 proteins comprising proteins that were tested in the screen but for which no interactions were detected. We then tested whether features enriched in d0 proteins (column 2) were also enriched in degree 0 proteins from Rual et al. 15 (column 3) and from CCSB HI (column 4). For most features (e.g., Protein Age: < 20M yrs), P- values were calculated using hypergeometric probability. For three features, high tissue specificity, short protein length, and low protein disorder, P-values were calculated using the Mann-Whitney U test and indicate whether degree 0 proteins had significantly lower (higher) values than proteins with degrees > 0 in the same screen(s) (e.g., column 3 compares proteins with degrees = 0 to proteins with degrees > 0 in Rual et al. 15 ). P-values were adjusted for multiple testing using FDR. Feature of d0 proteins d0 d0 Direct d0 Indirect Protein Age: < 20M yrs 2.56e e e-06 Protein Age: 20M - 90M yrs 6.91e e e-16 Protein Age: 90M - 320M yrs 1.97e e e-81 Protein Age: 320M - 800M yrs 2.71e e e-17 GO CC: extracellular region 5.62e e e-31 GO MF: signal transducer 8.69e e e-33 GO MF: receptor 2.35e e e-42 GO MF: nucleic acid binding 1.06e e E+00 GO MF: transporter 2.27e e e-07 GO BP: lipid metabolism 7.49e e e-05 GO BP: transmembrane transport 5.58e e e-05 GO BP: carbohydrate metabolism 2.52e e e-03 SCOP structural class: membrane 3.62e e e-54 High tissue specificity 1.75e e e-31 Short protein length 1.78e e e-18 Low protein disorder 1.29e e e-70 Supplementary Table 7 Biases of d0 proteins remain unchanged if the known human interactome is restricted to PPIs from methods that detect direct binding (e.g., Y2H) or to PPIs from methods that detect protein complexes (e.g., Co-IP). Our study defined d0 proteins as proteins absent from the known interactome, represented by the I2D database. PPI databases such as I2D include interactions from methods that detect direct binding or protein complexes. Complexes are represented as binary interactions using a spoke model, which assumes that the bait interacts with all complex members. Thus, our d0 proteins had neither direct nor indirect (i.e., spoke model) evidence for interaction. Biases of d0 proteins, along with corresponding P-values are shown in columns 1 and 2. We investigated whether these biases would remain unchanged if the known interactome comprised a subset of PPIs: ones with evidence for direct binding, or ones based on spoke models of detected complexes. Column 3 shows P-values of proteins, D0 Direct, absent from the set of directly binding PPIs. Column 4 shows P-values of proteins, D0 Indirect, absent from the set of PPIs based on spoke models. Most biases remain
9 significant when the definition of the known human interactome is altered. P-values were calculated by hypergeometric and Mann- Whitney U tests, and adjusted for multiple testing using F Model organism Human genes without 1:1 orthologs Human d0 genes without 1:1 orthologs Deficiency P-value Yeast (S. cerevisiae) (78%) 6446 (89%) 1.26e-174 Worm (C. elegans) (57%) 5363 (74%) 4.20e-305 Fly (D. melanogaster) (53%) 5228 (72%) 1.00e-308 Mouse (M. musculus) 3007 (15%) 2241 (31%) 1.00e-308 Rat (R. norvegicus) 3739 (19%) 2452 (34%) 1.00e-308 Supplementary Table 8 D0 proteins are less likely to have orthologs in model organisms than other human proteins. The above table shows five model organisms (column 1), the total number of human proteins without 1:1 orthologs in these organisms (column 2), corresponding numbers of d0 proteins, and P-values characterizing the deficiency of d0 proteins with model organism orthologs. P-values were calculated by hypergeometric tests and adjusted for multiple testing using FDR. Feature of d0 proteins d0 d1-d4 d5-d15 Protein Age: < 20M yrs 2.56e e e-01 Protein Age: 20M - 90M yrs 6.91e e e-01 Protein Age: 90M - 320M yrs 1.97e e e-01 Protein Age: 320M - 800M yrs 2.71e e e-01 GO CC: extracellular region 5.62e e e-01 GO MF: signal transducer 8.69e e e+00 GO MF: receptor 2.35e e e+00 GO MF: nucleic acid binding 1.06e e e+00 GO MF: transporter 2.27e e e+00 GO BP: lipid metabolism 7.49e e e+00 GO BP: transmembrane transport 5.58e e e+00 GO BP: carbohydrate metabolism 2.52e e e+00 SCOP structural class: membrane 3.62e e e-01 High tissue specificity 8.75e e e-01 Short protein length 1.78e e e-01 Low protein disorder 6.47e e e+00 Supplementary Table 9 Human proteins with few known interactions (degrees 1 through 4) have most of the same biases as d0 proteins, while proteins with higher degrees have few such biases. The above table shows features enriched in d0 proteins (column 1), and enrichment P-values of these features in d0 proteins (column 2), proteins with degrees 1 through 4 (column 3), proteins with degrees 5 through 15 (column 4), and proteins with degrees 16 through 1651 (column 5). Most features (18/22) enriched in d0 proteins are also enriched in proteins with degrees 1 through 4; only 1 feature is enriched in proteins with higher degrees. P-values were calculated by hypergeometric and Mann-Whitney U tests, and adjusted for multiple testing using FDR.
10 Feature of d0 proteins d0 d0 Fp60 Protein Age: < 20M yrs 2.56e e-01 Protein Age: 20M - 90M yrs 6.91e e-01 Protein Age: 90M - 320M yrs 1.97e e-10 Protein Age: 320M - 800M yrs 2.71e e-07 GO CC: extracellular region 5.62e e-04 GO MF: signal transducer 8.69e e+00 GO MF: receptor 2.35e e-01 GO MF: nucleic acid binding 1.06e e+00 GO MF: transporter 2.27e e-03 GO BP: lipid metabolism 7.49e e-01 GO BP: transmembrane transport 5.58e e-01 GO BP: carbohydrate metabolism 2.52e e-01 SCOP structural class: membrane 3.62e e-04 High tissue specificity 8.75e e-01 Short protein length 1.78e e-01 Low protein disorder 6.47e e-12 Supplementary Table 10 D0 proteins with predicted high-confidence interactions have many of the same biases as other d0 proteins. The above table shows features enriched in d0 proteins (column 1), enrichment P-values among all d0 proteins (column 2), and enrichment P-values among d0 proteins in the Fp60 network (i.e., d0 proteins with high-confidence predicted interactions). Supplementary Note 1 1. Features of individual proteins: description and sources Features of individual proteins comprised protein domains, post-translational modifications (PTMs), structuralchemical protein features, and Gene Ontology annotations - cellular component, molecular function, and biological process. Domains were obtained from InterPro 28, version 21.0, and UniProt 8, release Posttranslational modifications were obtained from UniProt 8, release 15.0, and the Human Protein Reference Database (HPRD) 29, release 8.0. The domains and PTMs of a protein were represented by a binary vector; each entry in the vector indicated whether a given domain or PTM was present in the protein. Structural-chemical features were determined from protein sequence by two programs: PSIPRED 30,31, version 26, and the pepstats application from the European Molecular Biology Open Software Suite 32. PSIPRED 30,31 was used to predict the fraction of a protein's residues in disordered regions, alpha helices, beta sheets and coils. Pepstats was used to calculate 11 chemical features: charge, isoelectric point, and the molar percent of each physico-chemical class of amino acid. Each feature calculated by PSIPRED 30,31 and pepstats was discretized into 7 intervals. An interval represented a range of percentiles; for example, the first interval contained values between the 0 th and 2.5 th percentiles. The 7 intervals were defined as follows: [0%,2.5%], [0%,10%], [0%,40%], [40%,60%], [60%,100%], [90%,100%], [97.5%,100%]. The intervals were overlapping, with the goal of capturing different levels of low and high values as well as intermediate values. Human Gene Ontology annotations were downloaded from the Gene Ontology 33 website1 on Apr.21, Proteins were annotated with cellular component, molecular function, and biological process terms specified in the downloaded file, as well ancestors of these terms. 2. Calculating interaction scores from features of individual proteins
11 To predict PPIs from features of individual proteins, we identified pairs of features enriched among known interacting protein pairs (such that each protein in an interacting pair has one of the features), and used these feature pairs as rules for predicting interactions. This approach was originally proposed by Sprinzak et al. 34 and has been used to predict PPIs based on pairs of domains 2,3,34, and pairs of post-translational modifications 3 enriched among known interacting protein pairs. The main aspects of the approach have remained similar across different studies: a single feature type is chosen (e.g., protein domains), a pair of features of this type (one feature on each protein) provides evidence of interaction, and the strength of the evidence is proportional to the enrichment of the feature pair among known interactions. To make predictions more comprehensive, less biased and ultimately more accurate, we extended this approach by considering three possibilities: (1) an interaction could require the presence of several features in a protein, (2) the required features could be of the same or of different types, and (3) the presence of particular features in a pair of proteins could provide evidence for or against interaction. To take into account these possibilities, we made 3 changes to the original approach: 1. we identified sets of features, of the same or different types, which co-occur in proteins; 2. we identified pairs of feature sets that were either enriched or deficient among known PPIs; 3. we filtered resulting feature set pairs to reduce redundancy and improve prediction accuracy Identifying sets of co-occurring features Interactions in which a protein participates may require the presence of several features (e.g., multiple domains) we assume that a group of such features is likely to co-occur together in proteins more frequently than expected by chance. To identify such feature sets we implemented a data-mining algorithm based on frequent pattern growth 35 a method that finds all sets of features that occur at least k times in a dataset, where k is a user-specified threshold. Input to a frequent pattern growth algorithm consists of records, where each record is a set of features. Our input comprised records for 19,698 proteins; each record contained features of a single protein, such as domains and PTMs. Output from frequent pattern growth consists of feature sets that are subsets of at least k records. For example, with a setting of k = 3, we identified the feature set, fmyb DNA binding domain, Homeodomain-related, which meant that these two domains occurred together in at least 3 proteins. Feature sets that have at least k occurrences are referred to as frequent feature sets, and their number of occurrences is referred to as support. To reduce the search time and output of frequent pattern growth, we made several modifications to the algorithm: (1) A feature set was discarded if its support was similar to that of its subsets, i.e., 80% of a subset's support. Such feature sets were not recorded and their supersets were not considered. This was done because a set with similar support to a subset would provide similar information about interaction the two sets would be present on most of the same proteins, and likely in many of the same interacting protein pairs. (2) Whenever a feature f was considered as an extension for feature set FS, the support of FS f had to be substantially higher than expected by chance. The expected support of FS f was calculated as: where N was the number of records in the dataset. To retain feature set FS f and expand it further, the following criteria had to be met: (1)
12 (2) (3) When feature f was considered as an extension for set FS, the probability of sup(fs f) being equal or greater than its observed value, had to be less than a minimum threshold. This probability was calculated with the hypergeometric distribution, H(N,M, n,m), using the following parameters: N = total records (i.e., number of proteins), M = sup(f), n = sup(fs), and m = sup(fs f). If the cumulative probability was > 0.05, feature set FS f was not recorded and its supersets were not considered. After identifying feature sets we annotated all proteins with both their original features and with feature sets. No distinction was made between the two types of annotations - original features were considered as feature sets of length 1. We refer to the feature sets of a protein, simply as features Identifying pairs of enriched and deficient feature sets For all annotated proteins we determined the support of features and feature pairs among training cases (interacting and non-interacting protein pairs). For each feature, i, we determined support values possup i, negsup i, among positive and negative training cases, respectively. Support was defined as the number of protein pairs where at least one of the proteins was annotated with the feature. Similarly, for each pair of features, (i, j), we calculated support values possup ij, negsup ij among positive and negative cases, respectively. In this case, support was defined as the number of protein pairs where one of the proteins had feature i and the other had feature j. For each feature pair, (i, j), we used support values to calculate several measures quantifying enrichment or deficiency of the pair among positive training cases. Two measures, rpos and ppos represented enrichment or deficiency relative to the expected occurrences of (i, j) among positive cases. We defined the expected occurrences of (i, j) among positive cases as:, (3) where npos is the number of positive training cases. rpos was the ratio between the observed and the expected support:. (4) If the value of rpos was greater than 1, ppos was the probability of possup ij being greater than or equal to its observed value, given the values of possup i and possup j. If rpos was less than 1, ppos was the probability of possup ij being less than or equal to its observed value. ppos was calculated by the cumulative hypergeometric distribution with the settings N = npos, M = possup i, n = possup j, m = possup ij. Two other measures, rall and pall, represented enrichment or deficiency of the feature pair (i, j) among positive cases, relative to the expected occurrences of (i, j) among negative cases. The number of expected occurrences among negative cases was defined as:
13 , (5) where nneg is the number of negative training cases. rall was defined as:. (6) rall > 1 indicated enrichment of (i, j) among positive cases, while rall < 1 indicated deficiency. pall was defined as the cumulative hypergeometric probability with parameters N = nall, M = npos, n = possup ij + negsup ij, m = possup ij, where nall is the total number of training cases. If rall was greater than 1 the right tail of the distribution was used, otherwise, the left tail was used. Feature pairs were considered to be enriched or deficient among positive cases if their values of ppos and pall were less than Calculating interaction scores from feature pairs A set, S, of enriched and deficient feature pairs was used to determine interaction scores. To calculate an interaction score for a protein pair P a, P b, feature pairs i, j from S were selected such that i was a feature of one protein, and j was a feature of the other. Among the selected feature pairs, the pairs with the highest and lowest rall values were identified. These pairs, fp max, fp min provided the strongest evidence for and against interaction, respectively. Their rall values, rall max and rall min, were used to set the interaction score as follows:. (7) 2.4. Filtering feature sets Using feature sets of length 1 resulted in a large number of redundant feature pairs: multiple feature pairs present in largely the same protein pairs. The presence of such feature pairs lowered prediction accuracy. This happened when feature pairs predicted the same true positive cases, but different false positive cases. For example, feature pairs i, j and k, l could each be fairly accurate, predicting ntp true positives and nfp false positives with nfp = 1/5 ntp. However, if they predict the same true positives but different false positives, then using them together results in ntp true positives and 2 nfp false positives. Combining more such feature pairs would give a linear increase in the number of false positives. To reduce this problem we ensured that only one feature pair could gain support from a positive training case. This was implemented with a greedy set cover algorithm. All feature pairs from S with rall > 1 were placed in a set P. The feature pair, fp max with the highest rall value was identified and moved from P to a set Q. Positive training cases where fp max was present were identified. If a feature pair in P was present in k of these cases, its possup value was lowered by k. Its rall value was recalculated, and if the rall value reached 1, the feature pair was removed from P. These steps were repeated until set P was empty. Interaction scores for test cases were then determined from feature pairs in Q and feature pairs in S, which had rall values less than Features of protein pairs
14 Features of protein pairs consisted of information about interacting orthologs, paralogs, gene co-expression and network topology. Information about interacting orthologs was taken from I2D 14, version 1.95 on Apr. 3, For a given protein pair, there were 5 interaction scores based on orthology information. If human proteins (i, j) had interacting orthologs (i, j ) in a model organism, sequence identities were determined between i and i, and between j and j. The lower of these sequence identities was used as a score for proteins (i, j). Such scores were determined based on 5 model organisms: mouse, rat, y, worm, and yeast. Sequence identities were obtained from Ensembl BioMart (Ensembl Genes 63). A score based on paralogy data was determined in a similar way: if proteins (i, j) had interacting paralogs (i, j ) in the training data, the lower of the two sequence identities, I(i, i ) and I(j, j ), was used as a score. Sequence identities were obtained from Ensembl BioMart (Ensembl Genes 63). Gene co-expression information was based on 10 gene expression datasets from the Gene Expression Omnibus 36 : GDS596, GDS1221, GDS1289, GDS1329, GDS1618, GDS1730, GDS2250, GDS2545, GDS2780, and GDS2842. Each of these datasets contained over 8,000 genes measured in at least 15 samples. Each dataset was processed by the MAS 5.0 algorithm, using the affy package (version ) in R (version 2.8) 37. Expression levels of each sample were mean-centered and levels from multiple probe sets for the same gene were averaged. Using each dataset, Pearson correlation coefficients were calculated for all available gene pairs; these correlations were used as interaction scores. Topology information was based on a PPI network comprising positive training cases. For a given protein pair (a, b), three interaction scores were calculated based on the known interactors of proteins a and b. The first score was simply the number of interactors shared by the two proteins: where I c is the set of shared interactors of proteins a and b. The second score, from Scott et al. 3, adjusted the number of shared interactors by the degrees of proteins a and b: where E c is the set of edges from proteins a and b to their shared interactors, E a is the set of edges of protein a and E a \E c is the set E a minus the set E c. Proteins a and b received a high score if most of their interactors were the same. The third score, S pshared, estimated the probability that proteins a and b would share at least their observed k common neighbors. This estimate depended on 4 variables: the degrees of a and b, the number of shared neighbors and the degrees of the shared neighbors. To derive this estimate we started with the simplifying assumption that all neighbors of a and b had equal degrees. We defined a such that deg(a) deg(b) and estimated the probability of exactly k shared neighbors as follows: (8) (9), (10) where P( ) is the probability of an interaction between a and a neighbor of b, n b. We defined P( ) as:
15 . (11) If degrees are not equal, the probability of k shared neighbors depends on the degrees of these neighbors. Since we were interested in the probability of the observed shared neighbors, N ba, we defined our required probability, P, as follows: the probability of a and b sharing at least k neighbors, with degrees similar or lower than those in the set N ba. Before calculating this probability we defined three sets containing neighbors of b: set N b containing all neighbors of b, set containing neighbors not shared with protein a, and the previously mentioned set N ba containing neighbors shared with protein a. We estimated P for exactly k shared neighbors as follows: where is the maximum degree of neighbors in the set N ba. (12) where, (13) where is the probability of an interaction between a and the j th neighbor in set. 4. Calculating a probability of interaction from interaction scores For a given protein pair, (i,j), we calculated a single probability of interaction based on interaction scores. This was done in three steps: first, each score was used to calculate a probability of interaction, second, these probabilities were combined into a single probability using a noisy-or model, and lastly, a final probability of interaction was calculated, taking into account the distribution of noisy-or probabilities among training cases, and the frequency of interactions among human protein pairs. In the first step, for each score, s(i,j),k, a probability of interaction was calculated as (14) In the second step, probabilities from all scores were integrated into a single probability using a noisy-or model 38,39 : P(i,j),noisy-OR = 1 -, (15) where n is the number of scores. In the third step a final probability of interaction was calculated as (16)
16 Lastly, this probability was adjusted to account for the fact that the frequency of positive cases in training data, 1:100, is likely higher than the frequency of interactions among human protein pairs, which we assumed to be 1:600. We viewed this adjustment as recalculating the posterior probability,, using a prior of 1/601 rather than 1/101; we refer to the unadjusted probability as P100 and to the adjusted probability as P600. First, we calculated a likelihood ratio, LR, based on P100. Next, we calculated P600 based on LR and a prior probability of 1/601. where,, References 1. D Haeseleer, P. & Church, G. M. Estimating and improving protein interaction error rates. Proc IEEE Comput Syst Bioinform Conf (2004). 2. Rhodes, D. R. et al. Probabilistic model of the human protein-protein interaction network. Nat Biotechnol 23, (2005). 3. Scott, M. S. & Barton, G. J. Probabilistic prediction and ranking of human protein-protein interactions. BMC Bioinformatics 8, 239 (2007). 4. Elefsinioti, A. et al. Large-scale de novo prediction of physical protein-protein association. Mol Cell Proteomics 10, M (2011). 5. Zhang, Q. C. et al. Structure-based prediction of protein-protein interactions on a genome-wide scale. Nature 490, (2012). 6. Imming, P., Sinning, C. & Meyer, A. Drugs, their targets and the nature and number of drug targets. Nat Rev Drug Discov 5, (2006). 7. Andreeva, A. et al. Data growth and its impact on the SCOP database: new developments. Nucleic Acids Res 36, D (2008). 8. Consortium, T. U. The Universal Protein Resource (UniProt) in Nucleic Acids Res 38, D142 8 (2010).
17 9. Ward, J. J., Sodhi, J. S., McGuffin, L. J., Buxton, B. F. & Jones, D. T. Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. J Mol Biol 337, (2004). 10. Su, A. I. et al. A gene atlas of the mouse and human protein-encoding transcriptomes. Proc Natl Acad Sci U S A 101, (2004). 11. Roth, R. B. et al. Gene expression analyses reveal molecular relationships among 20 regions of the human CNS. Neurogenetics 7, (2006). 12. Wang, E. T. et al. Alternative isoform regulation in human tissue transcriptomes. Nature 456, (2008). 13. Krupp, M. et al. RNA-Seq Atlas--a reference database for gene expression profiling in normal tissue by next-generation sequencing. Bioinformatics 28, (2012). 14. Brown, K. R. & Jurisica, I. Unequal evolutionary conservation of human protein interactions in interologous networks. Genome Biol 8, R95 (2007). 15. Rual, J. F. et al. Towards a proteome-scale map of the human protein-protein interaction network. Nature 437, (2005). 16. Barrios-Rodiles, M. et al. High-throughput mapping of a dynamic signaling network in mammalian cells. Science (80-. ). 307, (2005). 17. Zhu, H. et al. Global analysis of protein activities using proteome chips. Science (80-. ). 293, (2001). 18. Behrends, C., Sowa, M. E., Gygi, S. P. & Harper, J. W. Network organization of the human autophagy system. Nature 466, (2010). 19. Bouwmeester, T. et al. A physical and functional map of the human TNF-alpha/NF-kappa B signal transduction pathway. Nat Cell Biol 6, (2004). 20. Glatter, T., Wepf, A., Aebersold, R. & Gstaiger, M. An integrated workflow for charting the human interaction proteome: insights into the PP2A system. Mol Syst Biol 5, 237 (2009). 21. Hutchins, J. R. et al. Systematic analysis of human protein complexes identifies chromosome segregation proteins. Science (80-. ). 328, (2010). 22. Jeronimo, C. et al. Systematic analysis of the protein interaction network for the human transcription machinery reveals the identity of the 7SK capping enzyme. Mol Cell 27, (2007). 23. Jorgensen, C. et al. Cell-specific information processing in segregating populations of Eph receptor ephrin-expressing cells. Science (80-. ). 326, (2009). 24. Sowa, M. E., Bennett, E. J., Gygi, S. P. & Harper, J. W. Defining the human deubiquitinating enzyme interaction landscape. Cell 138, (2009).
18 25. Xiao, K. et al. Functional specialization of beta-arrestin interactions revealed by proteomic analysis. Proc Natl Acad Sci U S A 104, (2007). 26. Bandyopadhyay, S. et al. A human MAP kinase interactome. Nat Methods 7, (2010). 27. Wang, J. et al. Toward an understanding of the protein interaction network of the human liver. Mol Syst Biol 7, 536 (2011). 28. Hunter, S. et al. InterPro: the integrative protein signature database. Nucleic Acids Res 37, D211 5 (2009). 29. Keshava Prasad, T. S. et al. Human Protein Reference Database update. Nucleic Acids Res 37, D (2009). 30. Bryson, K. et al. Protein structure prediction servers at University College London. Nucleic Acids Res 33, W36 8 (2005). 31. Jones, D. T. Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol 292, (1999). 32. Rice, P., Longden, I. & Bleasby, A. EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet 16, (2000). 33. Consortium, G. O. Creating the gene ontology resource: design and implementation. Genome Res 11, (2001). 34. Sprinzak, E. & Margalit, H. Correlated sequence-signatures as markers of protein-protein interaction. J Mol Biol 311, (2001). 35. Han, J., Pei, J. & Yin, Y. Mining frequent patterns without candidate generation (2000). 36. Barrett, T. et al. NCBI GEO: mining millions of expression profiles--database and tools. Nucleic Acids Res 33, D562 6 (2005). 37. Team, R. D. C. R: A Language and Environment for Statistical Computing. (2008). 38. Kim, J. H. & Pearl, J. A computational model for causal and diagnostic reasoning in inference system. IJCAI (1983). 39. Parsons, S. & Bigham, J. Possibility theory and the generalised Noisy OR model. in Proc. 6th Int. Conf. Inf. Process. Manag. Uncertain (1996).
2 The Proteome. The Proteome 15
The Proteome 15 2 The Proteome 2.1. The Proteome and the Genome Each of our cells contains all the information necessary to make a complete human being. However, not all the genes are expressed in all
More informationComparison of Human Protein-Protein Interaction Maps
Comparison of Human Protein-Protein Interaction Maps Matthias E. Futschik 1, Gautam Chaurasia 1,2, Erich Wanker 2 and Hanspeter Herzel 1 1 Institute for Theoretical Biology, Charité, Humboldt-Universität
More informationTypes of biological networks. I. Intra-cellurar networks
Types of biological networks I. Intra-cellurar networks 1 Some intra-cellular networks: 1. Metabolic networks 2. Transcriptional regulation networks 3. Cell signalling networks 4. Protein-protein interaction
More informationGRAPH-THEORETICAL COMPARISON REVEALS STRUCTURAL DIVERGENCE OF HUMAN PROTEIN INTERACTION NETWORKS
141 GRAPH-THEORETICAL COMPARISON REVEALS STRUCTURAL DIVERGENCE OF HUMAN PROTEIN INTERACTION NETWORKS MATTHIAS E. FUTSCHIK 1 ANNA TSCHAUT 2 m.futschik@staff.hu-berlin.de tschaut@zedat.fu-berlin.de GAUTAM
More informationProteomics. Areas of Interest
Introduction to BioMEMS & Medical Microdevices Proteomics and Protein Microarrays Companion lecture to the textbook: Fundamentals of BioMEMS and Medical Microdevices, by Prof., http://saliterman.umn.edu/
More informationNature Structural and Molecular Biology: doi: /nsmb Supplementary Figure 1
Supplementary Figure 1 SUMOylation of proteins changes drastically upon heat shock, MG-132 treatment and PR-619 treatment. (a) Schematic overview of all SUMOylation proteins identified to be differentially
More informationCSCE555 Bioinformatics. Protein Function Annotation
CSCE555 Bioinformatics Protein Function Annotation Why we need to do function annotation? Fig from: Network-based prediction of protein function. Molecular Systems Biology 3:88. 2007 What s function? The
More informationEBI web resources II: Ensembl and InterPro
EBI web resources II: Ensembl and InterPro Yanbin Yin http://www.ebi.ac.uk/training/online/course/ 1 Homework 3 Go to http://www.ebi.ac.uk/interpro/training.htmland finish the second online training course
More informationSupplementary text for the section Interactions conserved across species: can one select the conserved interactions?
1 Supporting Information: What Evidence is There for the Homology of Protein-Protein Interactions? Anna C. F. Lewis, Nick S. Jones, Mason A. Porter, Charlotte M. Deane Supplementary text for the section
More informationTowards Detecting Protein Complexes from Protein Interaction Data
Towards Detecting Protein Complexes from Protein Interaction Data Pengjun Pei 1 and Aidong Zhang 1 Department of Computer Science and Engineering State University of New York at Buffalo Buffalo NY 14260,
More informationComparative RNA-seq analysis of transcriptome dynamics during petal development in Rosa chinensis
Title Comparative RNA-seq analysis of transcriptome dynamics during petal development in Rosa chinensis Author list Yu Han 1, Huihua Wan 1, Tangren Cheng 1, Jia Wang 1, Weiru Yang 1, Huitang Pan 1* & Qixiang
More informationIntroduction to Bioinformatics
CSCI8980: Applied Machine Learning in Computational Biology Introduction to Bioinformatics Rui Kuang Department of Computer Science and Engineering University of Minnesota kuang@cs.umn.edu History of Bioinformatics
More informationComputational methods for predicting protein-protein interactions
Computational methods for predicting protein-protein interactions Tomi Peltola T-61.6070 Special course in bioinformatics I 3.4.2008 Outline Biological background Protein-protein interactions Computational
More information10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison
10-810: Advanced Algorithms and Models for Computational Biology microrna and Whole Genome Comparison Central Dogma: 90s Transcription factors DNA transcription mrna translation Proteins Central Dogma:
More informationPredicting Protein Functions and Domain Interactions from Protein Interactions
Predicting Protein Functions and Domain Interactions from Protein Interactions Fengzhu Sun, PhD Center for Computational and Experimental Genomics University of Southern California Outline High-throughput
More informationS1 Gene ontology (GO) analysis of the network alignment results
1 Supplementary Material for Effective comparative analysis of protein-protein interaction networks by measuring the steady-state network flow using a Markov model Hyundoo Jeong 1, Xiaoning Qian 1 and
More informationSystems biology and biological networks
Systems Biology Workshop Systems biology and biological networks Center for Biological Sequence Analysis Networks in electronics Radio kindly provided by Lazebnik, Cancer Cell, 2002 Systems Biology Workshop,
More informationGenomics and bioinformatics summary. Finding genes -- computer searches
Genomics and bioinformatics summary 1. Gene finding: computer searches, cdnas, ESTs, 2. Microarrays 3. Use BLAST to find homologous sequences 4. Multiple sequence alignments (MSAs) 5. Trees quantify sequence
More informationNetwork Biology-part II
Network Biology-part II Jun Zhu, Ph. D. Professor of Genomics and Genetic Sciences Icahn Institute of Genomics and Multi-scale Biology The Tisch Cancer Institute Icahn Medical School at Mount Sinai New
More informationSupplementary Information 16
Supplementary Information 16 Cellular Component % of Genes 50 45 40 35 30 25 20 15 10 5 0 human mouse extracellular other membranes plasma membrane cytosol cytoskeleton mitochondrion ER/Golgi translational
More informationGene Ontology and Functional Enrichment. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein
Gene Ontology and Functional Enrichment Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein The parsimony principle: A quick review Find the tree that requires the fewest
More informationBMD645. Integration of Omics
BMD645 Integration of Omics Shu-Jen Chen, Chang Gung University Dec. 11, 2009 1 Traditional Biology vs. Systems Biology Traditional biology : Single genes or proteins Systems biology: Simultaneously study
More informationComparison of Protein-Protein Interaction Confidence Assignment Schemes
Comparison of Protein-Protein Interaction Confidence Assignment Schemes Silpa Suthram 1, Tomer Shlomi 2, Eytan Ruppin 2, Roded Sharan 2, and Trey Ideker 1 1 Department of Bioengineering, University of
More informationBioinformatics. Dept. of Computational Biology & Bioinformatics
Bioinformatics Dept. of Computational Biology & Bioinformatics 3 Bioinformatics - play with sequences & structures Dept. of Computational Biology & Bioinformatics 4 ORGANIZATION OF LIFE ROLE OF BIOINFORMATICS
More information1-D Predictions. Prediction of local features: Secondary structure & surface exposure
1-D Predictions Prediction of local features: Secondary structure & surface exposure 1 Learning Objectives After today s session you should be able to: Explain the meaning and usage of the following local
More informationRobust Community Detection Methods with Resolution Parameter for Complex Detection in Protein Protein Interaction Networks
Robust Community Detection Methods with Resolution Parameter for Complex Detection in Protein Protein Interaction Networks Twan van Laarhoven and Elena Marchiori Institute for Computing and Information
More informationGraph Alignment and Biological Networks
Graph Alignment and Biological Networks Johannes Berg http://www.uni-koeln.de/ berg Institute for Theoretical Physics University of Cologne Germany p.1/12 Networks in molecular biology New large-scale
More informationGenome Annotation. Bioinformatics and Computational Biology. Genome sequencing Assembly. Gene prediction. Protein targeting.
Genome Annotation Bioinformatics and Computational Biology Genome Annotation Frank Oliver Glöckner 1 Genome Analysis Roadmap Genome sequencing Assembly Gene prediction Protein targeting trna prediction
More informationNetworks & pathways. Hedi Peterson MTAT Bioinformatics
Networks & pathways Hedi Peterson (peterson@quretec.com) MTAT.03.239 Bioinformatics 03.11.2010 Networks are graphs Nodes Edges Edges Directed, undirected, weighted Nodes Genes Proteins Metabolites Enzymes
More informationComparative Features of Multicellular Eukaryotic Genomes
Comparative Features of Multicellular Eukaryotic Genomes C elegans A thaliana O. Sativa D. melanogaster M. musculus H. sapiens Size (Mb) 97 115 389 120 2500 2900 # Genes 18,425 25,498 37,544 13,601 30,000
More informationIntegrative Protein Function Transfer using Factor Graphs and Heterogeneous Data Sources
Integrative Protein Function Transfer using Factor Graphs and Heterogeneous Data Sources Antonina Mitrofanova New York University antonina@cs.nyu.edu Vladimir Pavlovic Rutgers University vladimir@cs.rutgers.edu
More informationResearch Article HomoKinase: A Curated Database of Human Protein Kinases
ISRN Computational Biology Volume 2013, Article ID 417634, 5 pages http://dx.doi.org/10.1155/2013/417634 Research Article HomoKinase: A Curated Database of Human Protein Kinases Suresh Subramani, Saranya
More informationIdentifying Signaling Pathways
These slides, excluding third-party material, are licensed under CC BY-NC 4.0 by Anthony Gitter, Mark Craven, Colin Dewey Identifying Signaling Pathways BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 2018
More informationThe geneticist s questions
The geneticist s questions a) What is consequence of reduced gene function? 1) gene knockout (deletion, RNAi) b) What is the consequence of increased gene function? 2) gene overexpression c) What does
More informationRegulation of gene expression. Premedical - Biology
Regulation of gene expression Premedical - Biology Regulation of gene expression in prokaryotic cell Operon units system of negative feedback positive and negative regulation in eukaryotic cell - at any
More informationLecture 10: May 19, High-Throughput technologies for measuring proteinprotein
Analysis of Gene Expression Data Spring Semester, 2005 Lecture 10: May 19, 2005 Lecturer: Roded Sharan Scribe: Daniela Raijman and Igor Ulitsky 10.1 Protein Interaction Networks In the past we have discussed
More informationProteomics Systems Biology
Dr. Sanjeeva Srivastava IIT Bombay Proteomics Systems Biology IIT Bombay 2 1 DNA Genomics RNA Transcriptomics Global Cellular Protein Proteomics Global Cellular Metabolite Metabolomics Global Cellular
More informationPrediction and Classif ication of Human G-protein Coupled Receptors Based on Support Vector Machines
Article Prediction and Classif ication of Human G-protein Coupled Receptors Based on Support Vector Machines Yun-Fei Wang, Huan Chen, and Yan-Hong Zhou* Hubei Bioinformatics and Molecular Imaging Key Laboratory,
More informationChapter 15 Active Reading Guide Regulation of Gene Expression
Name: AP Biology Mr. Croft Chapter 15 Active Reading Guide Regulation of Gene Expression The overview for Chapter 15 introduces the idea that while all cells of an organism have all genes in the genome,
More informationYifei Bao. Beatrix. Manor Askenazi
Detection and Correction of Interference in MS1 Quantitation of Peptides Using their Isotope Distributions Yifei Bao Department of Computer Science Stevens Institute of Technology Beatrix Ueberheide Department
More informationProtein-protein interaction networks Prof. Peter Csermely
Protein-Protein Interaction Networks 1 Department of Medical Chemistry Semmelweis University, Budapest, Hungary www.linkgroup.hu csermely@eok.sote.hu Advantages of multi-disciplinarity Networks have general
More informationMarkov Random Field Models of Transient Interactions Between Protein Complexes in Yeast
Markov Random Field Models of Transient Interactions Between Protein Complexes in Yeast Boyko Kakaradov Department of Computer Science, Stanford University June 10, 2008 Motivation: Mapping all transient
More informationEnsembl focuses on metazoan (animal) genomes. The genomes currently available at the Ensembl site are:
Comparative genomics and proteomics Species available Ensembl focuses on metazoan (animal) genomes. The genomes currently available at the Ensembl site are: Vertebrates: human, chimpanzee, mouse, rat,
More informationA Multiobjective GO based Approach to Protein Complex Detection
Available online at www.sciencedirect.com Procedia Technology 4 (2012 ) 555 560 C3IT-2012 A Multiobjective GO based Approach to Protein Complex Detection Sumanta Ray a, Moumita De b, Anirban Mukhopadhyay
More informationGene Control Mechanisms at Transcription and Translation Levels
Gene Control Mechanisms at Transcription and Translation Levels Dr. M. Vijayalakshmi School of Chemical and Biotechnology SASTRA University Joint Initiative of IITs and IISc Funded by MHRD Page 1 of 9
More informationEvidence for dynamically organized modularity in the yeast protein-protein interaction network
Evidence for dynamically organized modularity in the yeast protein-protein interaction network Sari Bombino Helsinki 27.3.2007 UNIVERSITY OF HELSINKI Department of Computer Science Seminar on Computational
More informationHands-On Nine The PAX6 Gene and Protein
Hands-On Nine The PAX6 Gene and Protein Main Purpose of Hands-On Activity: Using bioinformatics tools to examine the sequences, homology, and disease relevance of the Pax6: a master gene of eye formation.
More informationPrediction of protein function from sequence analysis
Prediction of protein function from sequence analysis Rita Casadio BIOCOMPUTING GROUP University of Bologna, Italy The omic era Genome Sequencing Projects: Archaea: 74 species In Progress:52 Bacteria:
More informationInferring Protein-Signaling Networks
Inferring Protein-Signaling Networks Lectures 14 Nov 14, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall (JHN) 022 1
More informationGENE ONTOLOGY (GO) Wilver Martínez Martínez Giovanny Silva Rincón
GENE ONTOLOGY (GO) Wilver Martínez Martínez Giovanny Silva Rincón What is GO? The Gene Ontology (GO) project is a collaborative effort to address the need for consistent descriptions of gene products in
More informationProteomics. Yeast two hybrid. Proteomics - PAGE techniques. Data obtained. What is it?
Proteomics What is it? Reveal protein interactions Protein profiling in a sample Yeast two hybrid screening High throughput 2D PAGE Automatic analysis of 2D Page Yeast two hybrid Use two mating strains
More informationProcedure to Create NCBI KOGS
Procedure to Create NCBI KOGS full details in: Tatusov et al (2003) BMC Bioinformatics 4:41. 1. Detect and mask typical repetitive domains Reason: masking prevents spurious lumping of non-orthologs based
More informationSmart pooling for interactome mapping
Smart pooling for interactome mapping Nicolas Thierry Mieg CNRS / TIMC IMAG / TIMB, Grenoble collaboration with Marc Vidal, CCSB / DFCI, Boston TSB Workshop, Grenoble 10/10/2007 Rual et al, Nature 2005
More informationSmall RNA in rice genome
Vol. 45 No. 5 SCIENCE IN CHINA (Series C) October 2002 Small RNA in rice genome WANG Kai ( 1, ZHU Xiaopeng ( 2, ZHONG Lan ( 1,3 & CHEN Runsheng ( 1,2 1. Beijing Genomics Institute/Center of Genomics and
More informationThe Eukaryotic Genome and Its Expression. The Eukaryotic Genome and Its Expression. A. The Eukaryotic Genome. Lecture Series 11
The Eukaryotic Genome and Its Expression Lecture Series 11 The Eukaryotic Genome and Its Expression A. The Eukaryotic Genome B. Repetitive Sequences (rem: teleomeres) C. The Structures of Protein-Coding
More informationOrganization of Genes Differs in Prokaryotic and Eukaryotic DNA Chapter 10 p
Organization of Genes Differs in Prokaryotic and Eukaryotic DNA Chapter 10 p.110-114 Arrangement of information in DNA----- requirements for RNA Common arrangement of protein-coding genes in prokaryotes=
More informationBioinformatics: Network Analysis
Bioinformatics: Network Analysis Comparative Network Analysis COMP 572 (BIOS 572 / BIOE 564) - Fall 2013 Luay Nakhleh, Rice University 1 Biomolecular Network Components 2 Accumulation of Network Components
More informationIntegrative Protein Function Transfer using Factor Graphs and Heterogeneous Data Sources
Integrative Protein Function Transfer using Factor Graphs and Heterogeneous Data Sources Antonina Mitrofanova New York University antonina@cs.nyu.edu Vladimir Pavlovic Rutgers University vladimir@cs.rutgers.edu
More informationNumber sequence representation of protein structures based on the second derivative of a folded tetrahedron sequence
Number sequence representation of protein structures based on the second derivative of a folded tetrahedron sequence Naoto Morikawa (nmorika@genocript.com) October 7, 2006. Abstract A protein is a sequence
More informationINTERACTIVE CLUSTERING FOR EXPLORATION OF GENOMIC DATA
INTERACTIVE CLUSTERING FOR EXPLORATION OF GENOMIC DATA XIUFENG WAN xw6@cs.msstate.edu Department of Computer Science Box 9637 JOHN A. BOYLE jab@ra.msstate.edu Department of Biochemistry and Molecular Biology
More informationDrosophila melanogaster and D. simulans, two fruit fly species that are nearly
Comparative Genomics: Human versus chimpanzee 1. Introduction The chimpanzee is the closest living relative to humans. The two species are nearly identical in DNA sequence (>98% identity), yet vastly different
More informationSystematic prediction of gene function in Arabidopsis thaliana using a probabilistic functional gene network
Systematic prediction of gene function in Arabidopsis thaliana using a probabilistic functional gene network Sohyun Hwang 1, Seung Y Rhee 2, Edward M Marcotte 3,4 & Insuk Lee 1 protocol 1 Department of
More informationAnalysis of Biological Networks: Network Robustness and Evolution
Analysis of Biological Networks: Network Robustness and Evolution Lecturer: Roded Sharan Scribers: Sasha Medvedovsky and Eitan Hirsh Lecture 14, February 2, 2006 1 Introduction The chapter is divided into
More informationLecture 3: A basic statistical concept
Lecture 3: A basic statistical concept P value In statistical hypothesis testing, the p value is the probability of obtaining a result at least as extreme as the one that was actually observed, assuming
More informationSUPPLEMENTARY INFORMATION
Supplementary information S1 (box). Supplementary Methods description. Prokaryotic Genome Database Archaeal and bacterial genome sequences were downloaded from the NCBI FTP site (ftp://ftp.ncbi.nlm.nih.gov/genomes/all/)
More informationSupplementary Information
Supplementary Information Supplementary Figure 1. Schematic pipeline for single-cell genome assembly, cleaning and annotation. a. The assembly process was optimized to account for multiple cells putatively
More informationStatistical Machine Learning Methods for Bioinformatics IV. Neural Network & Deep Learning Applications in Bioinformatics
Statistical Machine Learning Methods for Bioinformatics IV. Neural Network & Deep Learning Applications in Bioinformatics Jianlin Cheng, PhD Department of Computer Science University of Missouri, Columbia
More informationAnalysis and visualization of protein-protein interactions. Olga Vitek Assistant Professor Statistics and Computer Science
1 Analysis and visualization of protein-protein interactions Olga Vitek Assistant Professor Statistics and Computer Science 2 Outline 1. Protein-protein interactions 2. Using graph structures to study
More informationSCOP. all-β class. all-α class, 3 different folds. T4 endonuclease V. 4-helical cytokines. Globin-like
SCOP all-β class 4-helical cytokines T4 endonuclease V all-α class, 3 different folds Globin-like TIM-barrel fold α/β class Profilin-like fold α+β class http://scop.mrc-lmb.cam.ac.uk/scop CATH Class, Architecture,
More informationCATEGORY a TERM COUNT b P VALUE GENE FAMILIES: REPRESENTATIVE GENE SYMBOLS c. Annotation Cluster 1 Enrichment Score: 1.
Table S7 GO term and functional classification enrichment analysis using DAVID for gene families that are expanded in the Drosophila suzukii genome as compared to 14 Drosophila species analyzed in this
More informationWritten Exam 15 December Course name: Introduction to Systems Biology Course no
Technical University of Denmark Written Exam 15 December 2008 Course name: Introduction to Systems Biology Course no. 27041 Aids allowed: Open book exam Provide your answers and calculations on separate
More informationBioinformatics Chapter 1. Introduction
Bioinformatics Chapter 1. Introduction Outline! Biological Data in Digital Symbol Sequences! Genomes Diversity, Size, and Structure! Proteins and Proteomes! On the Information Content of Biological Sequences!
More informationBustamante et al., Supplementary Nature Manuscript # 1 out of 9 Information #
Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Details of PRF Methodology In the Poisson Random Field PRF) model, it is assumed that non-synonymous mutations at a given gene are either
More informationDynamic modular architecture of protein-protein interaction networks beyond the dichotomy of date and party hubs
Dynamic modular architecture of protein-protein interaction networks beyond the dichotomy of date and party hubs Xiao Chang 1,#, Tao Xu 2,#, Yun Li 3, Kai Wang 1,4,5,* 1 Zilkha Neurogenetic Institute,
More information-max_target_seqs: maximum number of targets to report
Review of exercise 1 tblastn -num_threads 2 -db contig -query DH10B.fasta -out blastout.xls -evalue 1e-10 -outfmt "6 qseqid sseqid qstart qend sstart send length nident pident evalue" Other options: -max_target_seqs:
More informationComputational Analyses of High-Throughput Protein-Protein Interaction Data
Current Protein and Peptide Science, 2003, 4, 159-181 159 Computational Analyses of High-Throughput Protein-Protein Interaction Data Yu Chen 1, 2 and Dong Xu 1, 2 * 1 Protein Informatics Group, Life Sciences
More informationMotif Prediction in Amino Acid Interaction Networks
Motif Prediction in Amino Acid Interaction Networks Omar GACI and Stefan BALEV Abstract In this paper we represent a protein as a graph where the vertices are amino acids and the edges are interactions
More informationBahnson Biochemistry Cume, April 8, 2006 The Structural Biology of Signal Transduction
Name page 1 of 6 Bahnson Biochemistry Cume, April 8, 2006 The Structural Biology of Signal Transduction Part I. The ion Ca 2+ can function as a 2 nd messenger. Pick a specific signal transduction pathway
More informationLecture Notes for Fall Network Modeling. Ernest Fraenkel
Lecture Notes for 20.320 Fall 2012 Network Modeling Ernest Fraenkel In this lecture we will explore ways in which network models can help us to understand better biological data. We will explore how networks
More informationBiological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor
Biological Networks:,, and via Relative Description Length By: Tamir Tuller & Benny Chor Presented by: Noga Grebla Content of the presentation Presenting the goals of the research Reviewing basic terms
More informationRegulation and signaling. Overview. Control of gene expression. Cells need to regulate the amounts of different proteins they express, depending on
Regulation and signaling Overview Cells need to regulate the amounts of different proteins they express, depending on cell development (skin vs liver cell) cell stage environmental conditions (food, temperature,
More informationLecture 4: Yeast as a model organism for functional and evolutionary genomics. Part II
Lecture 4: Yeast as a model organism for functional and evolutionary genomics Part II A brief review What have we discussed: Yeast genome in a glance Gene expression can tell us about yeast functions Transcriptional
More informationGenome-wide multilevel spatial interactome model of rice
Sino-German Workshop on Multiscale Spatial Computational Systems Biology, Beijing, Oct 8-12, 2015 Genome-wide multilevel spatial interactome model of rice Ming CHEN ( 陈铭 ) mchen@zju.edu.cn College of Life
More informationCorrespondence of D. melanogaster and C. elegans developmental stages revealed by alternative splicing characteristics of conserved exons
Gao and Li BMC Genomics (2017) 18:234 DOI 10.1186/s12864-017-3600-2 RESEARCH ARTICLE Open Access Correspondence of D. melanogaster and C. elegans developmental stages revealed by alternative splicing characteristics
More informationSupplementary information. A proposal for a novel impact factor as an alternative to the JCR impact factor
Supplementary information A proposal for a novel impact factor as an alternative to the JCR impact factor Zu-Guo Yang a and Chun-Ting Zhang b, * a Library, Tianjin University, Tianjin 300072, China b Department
More informationCarri-Lyn Mead Thursday, January 13, 2005 Terry Fox Laboratory, Dr. Dixie Mager
Investigating Trends in Transposable Element Insertion within Regulatory Regions Carri-Lyn Mead cmead@bcgsc.ca Thursday, January 13, 2005 Terry Fox Laboratory, Dr. Dixie Mager Outline Transposable Element
More informationTandem Mass Spectrometry: Generating function, alignment and assembly
Tandem Mass Spectrometry: Generating function, alignment and assembly With slides from Sangtae Kim and from Jones & Pevzner 2004 Determining reliability of identifications Can we use Target/Decoy to estimate
More informationComputational Structural Bioinformatics
Computational Structural Bioinformatics ECS129 Instructor: Patrice Koehl http://koehllab.genomecenter.ucdavis.edu/teaching/ecs129 koehl@cs.ucdavis.edu Learning curve Math / CS Biology/ Chemistry Pre-requisite
More informationUnderstanding Science Through the Lens of Computation. Richard M. Karp Nov. 3, 2007
Understanding Science Through the Lens of Computation Richard M. Karp Nov. 3, 2007 The Computational Lens Exposes the computational nature of natural processes and provides a language for their description.
More informationBIOINFORMATICS LAB AP BIOLOGY
BIOINFORMATICS LAB AP BIOLOGY Bioinformatics is the science of collecting and analyzing complex biological data. Bioinformatics combines computer science, statistics and biology to allow scientists to
More information2. Yeast two-hybrid system
2. Yeast two-hybrid system I. Process workflow a. Mating of haploid two-hybrid strains on YPD plates b. Replica-plating of diploids on selective plates c. Two-hydrid experiment plating on selective plates
More informationAn Efficient Algorithm for Protein-Protein Interaction Network Analysis to Discover Overlapping Functional Modules
An Efficient Algorithm for Protein-Protein Interaction Network Analysis to Discover Overlapping Functional Modules Ying Liu 1 Department of Computer Science, Mathematics and Science, College of Professional
More informationAmino Acid Structures from Klug & Cummings. 10/7/2003 CAP/CGS 5991: Lecture 7 1
Amino Acid Structures from Klug & Cummings 10/7/2003 CAP/CGS 5991: Lecture 7 1 Amino Acid Structures from Klug & Cummings 10/7/2003 CAP/CGS 5991: Lecture 7 2 Amino Acid Structures from Klug & Cummings
More informationMeasuring TF-DNA interactions
Measuring TF-DNA interactions How is Biological Complexity Achieved? Mediated by Transcription Factors (TFs) 2 Regulation of Gene Expression by Transcription Factors TF trans-acting factors TF TF TF TF
More informationProtein Structures. Sequences of amino acid residues 20 different amino acids. Quaternary. Primary. Tertiary. Secondary. 10/8/2002 Lecture 12 1
Protein Structures Sequences of amino acid residues 20 different amino acids Primary Secondary Tertiary Quaternary 10/8/2002 Lecture 12 1 Angles φ and ψ in the polypeptide chain 10/8/2002 Lecture 12 2
More informationConstructing Signal Transduction Networks Using Multiple Signaling Feature Data
Constructing Signal Transduction Networks Using Multiple Signaling Feature Data Thanh-Phuong Nguyen 1, Kenji Satou 2, Tu-Bao Ho 1 and Katsuhiko Takabayashi 3 1 Japan Advanced Institute of Science and Technology
More informationProtein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche
Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche The molecular structure of a protein can be broken down hierarchically. The primary structure of a protein is simply its
More informationGenome-Scale Gene Function Prediction Using Multiple Sources of High-Throughput Data in Yeast Saccharomyces cerevisiae ABSTRACT
OMICS A Journal of Integrative Biology Volume 8, Number 4, 2004 Mary Ann Liebert, Inc. Genome-Scale Gene Function Prediction Using Multiple Sources of High-Throughput Data in Yeast Saccharomyces cerevisiae
More informationMatrix-based pattern discovery algorithms
Regulatory Sequence Analysis Matrix-based pattern discovery algorithms Jacques.van.Helden@ulb.ac.be Université Libre de Bruxelles, Belgique Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe)
More informationImproved network-based identification of protein orthologs
BIOINFORMATICS Vol. 24 ECCB 28, pages i2 i26 doi:.93/bioinformatics/btn277 Improved network-based identification of protein orthologs Nir Yosef,, Roded Sharan and William Stafford Noble 2,3 School of Computer
More information