Using graphs to relate expression data and protein-protein interaction data

Similar documents
Hotspots and Causal Inference For Yeast Data

From protein networks to biological systems

CELL CYCLE RESPONSE STRESS AVGPCC. YER179W DMC1 meiosis-specific protein unclear

Causal Model Selection Hypothesis Tests in Systems Genetics: a tutorial

A Re-annotation of the Saccharomyces cerevisiae Genome

GENETICS. Supporting Information

Alterations in DNA Replication and Histone Levels Promote Histone Gene Amplification in Saccharomyces cerevisiae

The Repetitive Sequence Database and Mining Putative Regulatory Elements in Gene Promoter Regions ABSTRACT

Signal recognition YKL122c An01g02800 strong similarity to signal recognition particle 68K protein SRP68 - Canis lupus

Phylogenetic classification of transporters and other membrane proteins from Saccharomyces cerevisiae

Modeling and Visualizing Uncertainty in Gene Expression Clusters using Dirichlet Process Mixtures

DISCOVERING PROTEIN COMPLEXES IN DENSE RELIABLE NEIGHBORHOODS OF PROTEIN INTERACTION NETWORKS

Ascospore Formation in the Yeast Saccharomyces cerevisiae

Linking the Signaling Cascades and Dynamic Regulatory Networks Controlling Stress Responses

Systems biology and biological networks

Finding molecular complexes through multiple layer clustering of protein interaction networks. Bill Andreopoulos* and Aijun An

Comparative Methods for the Analysis of Gene-Expression Evolution: An Example Using Yeast Functional Genomic Data

Introduction to Microarray Data Analysis and Gene Networks lecture 8. Alvis Brazma European Bioinformatics Institute

Causal Graphical Models in Systems Genetics

Adaptation of Saccharomyces cerevisiae to high hydrostatic pressure causing growth inhibition

Detecting temporal protein complexes from dynamic protein-protein interaction networks

Analysis and visualization of protein-protein interactions. Olga Vitek Assistant Professor Statistics and Computer Science

Robust Sparse Estimation of Multiresponse Regression and Inverse Covariance Matrix via the L2 distance

Table S1, Yvert et al.

Genome-wide expression screens indicate a global role for protein kinase CK2 in chromatin remodeling

networks? networks are spaces generated by interaction

Defining the Budding Yeast Chromatin Associated Interactome

Supplemental Table S2. List of 166 Transcription Factors Deleted in Mutants Assayed in this Study ORF Gene Description

Research Article Predicting Protein Complexes in Weighted Dynamic PPI Networks Based on ICSC

Yeast require an Intact Tryptophan Biosynthesis Pathway and Exogenous Tryptophan for Resistance to Sodium Dodecyl Sulfate

Screening the yeast deletant mutant collection for hypersensitivity and hyper-resistance to sorbate, a weak organic acid food preservative

Genome-wide protein interaction screens reveal functional networks involving Sm-like proteins

Using Networks to Integrate Omic and Semantic Data: Towards Understanding Protein Function on a Genome Scale

Combining Microarrays and Biological Knowledge for Estimating Gene Networks via Bayesian Networks

Case story: Analysis of the Cell Cycle

Intersection of RNA Processing and Fatty Acid Synthesis and Attachment in Yeast Mitochondria

Compositional Correlation for Detecting Real Associations. Among Time Series

Clustering gene expression data using graph separators

Supplementary Information

(Fold Change) acetate biosynthesis ALD6 acetaldehyde dehydrogenase 3.0 acetyl-coa biosynthesis ACS2 acetyl-coa synthetase 2.5

Honors Thesis: Supplementary Material. LCC 4700: Undergraduate Thesis Writing

Hierarchical modelling of automated imaging data

Large-scale phenotypic analysis reveals identical contributions to cell functions of known and unknown yeast genes

The yeast interactome (unit: g303204)

Constraint-Based Workshops

Spatio-temporal dynamics of yeast mitochondrial biogenesis: transcriptional and post-transcriptional. mrna oscillatory modules.

How many protein-coding genes are there in the Saccharomyces cerevisiae genome?

Bacterial genome chimaerism and the origin of mitochondria

Bayesian inference for stochastic kinetic models. intracellular reaction networks

Genetic basis of mitochondrial function and morphology in Saccharomyces. Kai Stefan Dimmer, Stefan Fritz, Florian Fuchs, Marlies Messerschmitt, Nadja

Received 17 January 2005/Accepted 22 February 2005

Cytoscape An open-source software platform for the exploration of molecular interaction networks

Localizing proteins in the cell from their phylogenetic profiles

ppendix E - Growth Rates of Individual Genes on Various Non-Fementable Carbon So ORF Gene Lactate Glycerol EtOH Description

Mg 2 Deprivation Elicits Rapid Ca 2 Uptake and Activates Ca 2 /Calcineurin Signaling in Saccharomyces cerevisiae

Growth of Yeast, Saccharomyces cerevisiae, under Hypergravity Conditions

UvA-DARE (Digital Academic Repository)

ACCEPTED. Title: Increased respiration in the sch9 mutant is required for increasing chronological life span but not replicative life span

Identifying Correlations between Chromosomal Proximity of Genes and Distance of Their Products in Protein-Protein Interaction Networks of Yeast

DNA microarrays simultaneously monitor the expression

Origin and Consequences of the Relationship between Protein Mean and Variance

MRP4 MRPS5. function. Their functional link can be checked by double or triple deletion experiments. YDR036C

Protein complexes identification based on go attributed network embedding

MINIREVIEW. Eric M. Rubenstein and Martin C. Schmidt*

FUNPRED-1: PROTEIN FUNCTION PREDICTION FROM A PROTEIN INTERACTION NETWORK USING NEIGHBORHOOD ANALYSIS

BIOINFORMATICS. Predicting protein function from protein/protein interaction data: a probabilistic approach. Stanley Letovsky and Simon Kasif

The Yeast Nuclear Pore Complex: Composition, Architecture, and Transport Mechanism

A network of protein protein interactions in yeast

Guilt by Association of Common Interaction Partners. Limsoon Wong Joint work with Hon Nian Chua & Wing-Kin Sung

Analysis of Yeast s ORF Upstream Regions by Parallel Processing, Microarrays, and Computational Methods

Exploiting Indirect Neighbours and Topological Weight to Predict Protein Function from Protein-Protein Interactions

Dissection of transcriptional regulation networks and prediction of gene functions in Saccharomyces cerevisiae Boorsma, A.

Classification des séquences régulatrices sur base de motifs

Matrix-based pattern matching

Annotating gene function by combining expression data with a modular gene network

Received 29 May 2006/Accepted 8 June 2006

EXTRACTING GLOBAL STRUCTURE FROM GENE EXPRESSION PROFILES

Percentage of genomic features annotated in the S. cerevisiae reference genome covered by different S. cerevisiae assemblies.

Protein import into plant mitochondria: signals, machinery, processing, and regulation

Combining Microarrays and Biological Knowledge for Estimating Gene Networks via Bayesian Networks

BMC 2002, Bioinformatics

Corrections. MEDICAL SCIENCES. For the article Inhibitors of soluble epoxide

Updated catalogue of homologues to human disease-related proteins in the yeast genome

Computational Genomics. Reconstructing dynamic regulatory networks in multiple species

Mitochondrial inheritance and fermentative : oxidative balance in hybrids between Saccharomyces cerevisiae and Saccharomyces uvarum

Clustering of Pathogenic Genes in Human Co-regulatory Network. Michael Colavita Mentor: Soheil Feizi Fifth Annual MIT PRIMES Conference May 17, 2015

reconstructing biological networks from data: part 1 - cmonkey

Adhesion to the yeast cell surface as a mechanism for trapping pathogenic bacteria by Saccharomyces probiotics

6 Septation and Cytokinesis in Fungi

Predicting essential genes in fungal genomes

Introduction to Bioinformatics

Cy, Nu. Pre9 Rpn10 Rpn2 Rpn8 141: Pre10 Pre2 Pre3 Pre6 Pup3 Scl1

Bayesian Hierarchical Self Modeling Warping Regression

Mutational Robustness and Asymmetric Functional Specialization of Duplicate Genes

The role of PCNA as a scaffold protein in cellular signaling is functionally conserved between yeast and humans

Knowledge-based analysis of microarray gene expression data by using support vector machines

Bioinformatics. Transcriptome

Integrating naive Bayes models and external knowledge to examine copper and iron homeostasis in S. cerevisiae

Science histogram was a mistake. This mistake was corrected in this manuscript.

Genome-Scale Gene Function Prediction Using Multiple Sources of High-Throughput Data in Yeast Saccharomyces cerevisiae ABSTRACT

Transcription:

Using graphs to relate expression data and protein-protein interaction data R. Gentleman and D. Scholtens October 31, 2017 Introduction In Ge et al. (2001) the authors consider an interesting question. They assemble gene expression data from a yeast cell-cycle experiment (Cho et al., 1998), literature proteinprotein interaction (PPI) data and yeast two-hybrid data. We have curated the data slightly to make it simpler to carry out the analyses reported in Ge et al. (2001). In particular we reduced the data to the 2885 genes that were common to all experiments. We have also represented most of the data in terms of graphs (nodes and edges) since that is the form we will use for most of our analyses. The results reported in this document are based on version 0.25.0 of the yeastexpdata data package. 0.1 Graphs The cell-cycle data is stored in ccyclered and from this it is easy to create a clustergraph. A cluster graph is simply a graph that is created from clustered data. In this graph all genes in the same cluster have edges between them and there are no between cluster edges. clusts = split(ccyclered[["y.name"]], + ccyclered[["cluster"]]) cg1 = new("clustergraph", + clusters = lapply(clusts, as.character)) ccclust = conncomp(cg1) Next we are interested in looking for associations between the clusters in this graph (genes that show similar patterns of expression) and the literature curated set of proteinprotein interactions. 1

We note that these are not really literature purported protein complexes (readers might want to use the output of a procedure like that in apcomplex applied to TAP data to explore protein complex data). We also note that the data used to create these graphs is already somewhat out of date and that there is new data available from MIPS. We have chosen to continue with these data since they align closely with the report in Ge et al. (2001). data(litg) cclit = connectedcomp(litg) cclens = sapply(cclit, length) table(cclens) cclens 1 2 3 4 5 6 7 8 12 13 36 88 2587 29 10 7 1 1 2 1 1 1 1 1 ccl2 = cclit[cclens1] ccl2 = cclens[cclens1] We see that there are most of the proteins do not have edges to others and that there are a few, rather large sets of connected proteins. We can plot a few of those. sg1 = subgraph(ccl2[[5]], litg) sg2 = subgraph(ccl2[[1]], litg) It is worth noting that the structure in Figures 1 and 2 is suggestive of collections of proteins that form cohesive subgroups that work to achieve particular objectives. An open problem is the development of algorithms (and ultimately software and statistical models) capable of reliably identifying the constituent components. Among the important aspects of such a decomposition is the notion that some proteins will be involved in multiple complexes. 1 Testing Associations The hypothesis presented in? was to investigate a potential correlation between expression clusters and interaction clusters. The devised a strategy called the transcriptomeinteractome correlation mapping strategy to assess the level of dependence. Here we reformulate their ideas (a more extensive discussion with some extensions is provided in Balasubraminian et al. (2004)). 2

YDR382W YER009W YFL039C YLR229C YLR340W YDL127W YER111C YGR109C YGR152C YJL187C YKL042W YKL101W YLR212C YLR313C YMR199W YNL289W YPL256C YPR120C YOR160W YDR388W YJL157C YOR036W YPL031C YCL027W YLL024C YOR027W YOR185C YPL240C YMR294W YNL004W YNL243W YOL039W YOR098C YAL040C YBR200W YGR108W YPL242C YPR119W YCL040W YAL005C YLR216C YBR133C YER114C YDL179W YHR005C YJL194W YLR079W YLR319C YMR109W YDR432W YDR356W YEL003W YHR061C YHR172W YLL021W YNL126W YPL016W YDR085C YHR102W YOL016C YBR160W YMR308C YHL007C YKL068W YAL029C YBR109C YGL016W YLR293C YML065W YLR200W YML094W YMR092C YMR186W YDR309C YHR129C YAL041W YBL016W YBL079W YOR127W YPL174C YDR103W YDR323C YDR184C YHR069C YOL021C YOR181W YBR155W YPL140C Figure 1: One of the larger PPI graphs 3

YOL139C YDR180W YGR162W YDR386W YDL043C YMR117C YDR485C YML104C YMR106C YDL030W YDL013W YMR284W YDL042C YNL031C YDR227W YOR191W YNL030W YBR009C YLR134W YBR010W YIL144W YER179W YDL008W YNL172W YDR146C YBR073W YLR127C YDR118W YBL084C YER095W YDR004W YML032C YDR076W YAR007C YJL173C YNL312W Figure 2: Another of the larger PPI connected components. 4

The basic idea is to represent the different sets of interactions (relationships) in terms of graphs. Each graph is defined on the same set of nodes (the genes/proteins that are common to all experiments) and the specific set of relationships are represented as edges in the appropriate graph. Above we created one graph where the edges represent known (literature based) interactions, litg and the second is based on clustered gene expression data, cg1. We can now determine how many edges are in common between the two graphs by simply computing their intersection. commong = intersection(litg, cg1) commong A graphnel graph with undirected edges Number of Nodes = 2885 Number of Edges = 42 And we can see that there are 42 edges in common. Since the literature graph has 315 and the gene expression graph has 156205 we might wonder whether 42 is remarkably large or remarkably small. One way to test this assumption is to generate an appropriate null distribution and to compare the observed value (42) to the values from this distribution. While there are some reasons to consider a random edge model (as proposed by Erdos and Renyi) there are more compelling reasons to condition on the structure in the graph and to use a node label permutation distribution. This is demonstrated below. Note that this does take quite some time to run though. eperm = function (g1, g2, B=500) { + ans = rep(na, B) + n1 = nodes(g1) + + for(i in 1:B) { + nodes(g1) = sample(n1) + ans[i] = numedges(intersection(g1, g2)) + } + return(ans) + } set.seed(123) ##takes a long time #npdist = eperm(litg, cg1) data(npdist) max(npdist) [1] 23 5

2 Data Analysis Now that we have satisfied our testing curiosity we might want to start to carry out a little exploratory data analysis. There are clearly some questions that are of interest. They include the following: ˆ Which of the expression clusters have intersections and with which of the literature clusters? ˆ Are there expression clusters that have a number of literature cluster edges going between them (and hence suggesting that the expression clustering was too fine, or that the genes involved in the literature cluster are not cell-cycle regulated). ˆ Are there known cell-cycle regulated protein complexes and do the genes involved tend to cluster together? ˆ Is the expression behavior of genes that are involved in multiple literature clusters (or at least that we suspect of being so involved) different from that of genes that are involved in only one cluster? Many of these questions require access to more information. For example, we need to know more about the pattern of expression related to each of the gene expression clusters so that we can try to interpret them better. We need to have more information about the likely protein complexes from the literature data so that we can better identify reasonably complete protein complexes (and given them, then identify those genes that are involved in more than one complex). But, the most important fact to notice is that all of the substantial calculations and computations (given the meta-data) are very simple to put in graph theoretic terms. References R. Balasubraminian, T. LaFramboise, D. Scholtens, and R. Gentleman. Integrating disparate sources of protein protein interaction data. Technical report, Bioconductor, 2004. R.C. Cho et al. A genome-wide transcriptional analysis of the mitotic cell cycle. Mol. Cell, 2:65 73, 1998. H. Ge, Z. Liu, G. M. Church, and M. Vidal. Correlation between transcriptome and interactome mapping data from saccharomyces cerevisiae. Nature Genetics, 29:482 486, 2001. 6