Using graphs to relate expression data and protein-protein interaction data

Using graphs to relate expression data and protein-protein interaction data R. Gentleman and D. Scholtens October 31, 2017 Introduction In Ge et al. (2001) the authors consider an interesting question. They assemble gene expression data from a yeast cell-cycle experiment (Cho et al., 1998), literature proteinprotein interaction (PPI) data and yeast two-hybrid data. We have curated the data slightly to make it simpler to carry out the analyses reported in Ge et al. (2001). In particular we reduced the data to the 2885 genes that were common to all experiments. We have also represented most of the data in terms of graphs (nodes and edges) since that is the form we will use for most of our analyses. The results reported in this document are based on version 0.25.0 of the yeastexpdata data package. 0.1 Graphs The cell-cycle data is stored in ccyclered and from this it is easy to create a clustergraph. A cluster graph is simply a graph that is created from clustered data. In this graph all genes in the same cluster have edges between them and there are no between cluster edges. clusts = split(ccyclered[["y.name"]], + ccyclered[["cluster"]]) cg1 = new("clustergraph", + clusters = lapply(clusts, as.character)) ccclust = conncomp(cg1) Next we are interested in looking for associations between the clusters in this graph (genes that show similar patterns of expression) and the literature curated set of proteinprotein interactions. 1

We note that these are not really literature purported protein complexes (readers might want to use the output of a procedure like that in apcomplex applied to TAP data to explore protein complex data). We also note that the data used to create these graphs is already somewhat out of date and that there is new data available from MIPS. We have chosen to continue with these data since they align closely with the report in Ge et al. (2001). data(litg) cclit = connectedcomp(litg) cclens = sapply(cclit, length) table(cclens) cclens 1 2 3 4 5 6 7 8 12 13 36 88 2587 29 10 7 1 1 2 1 1 1 1 1 ccl2 = cclit[cclens1] ccl2 = cclens[cclens1] We see that there are most of the proteins do not have edges to others and that there are a few, rather large sets of connected proteins. We can plot a few of those. sg1 = subgraph(ccl2[[5]], litg) sg2 = subgraph(ccl2[[1]], litg) It is worth noting that the structure in Figures 1 and 2 is suggestive of collections of proteins that form cohesive subgroups that work to achieve particular objectives. An open problem is the development of algorithms (and ultimately software and statistical models) capable of reliably identifying the constituent components. Among the important aspects of such a decomposition is the notion that some proteins will be involved in multiple complexes. 1 Testing Associations The hypothesis presented in? was to investigate a potential correlation between expression clusters and interaction clusters. The devised a strategy called the transcriptomeinteractome correlation mapping strategy to assess the level of dependence. Here we reformulate their ideas (a more extensive discussion with some extensions is provided in Balasubraminian et al. (2004)). 2

YDR382W YER009W YFL039C YLR229C YLR340W YDL127W YER111C YGR109C YGR152C YJL187C YKL042W YKL101W YLR212C YLR313C YMR199W YNL289W YPL256C YPR120C YOR160W YDR388W YJL157C YOR036W YPL031C YCL027W YLL024C YOR027W YOR185C YPL240C YMR294W YNL004W YNL243W YOL039W YOR098C YAL040C YBR200W YGR108W YPL242C YPR119W YCL040W YAL005C YLR216C YBR133C YER114C YDL179W YHR005C YJL194W YLR079W YLR319C YMR109W YDR432W YDR356W YEL003W YHR061C YHR172W YLL021W YNL126W YPL016W YDR085C YHR102W YOL016C YBR160W YMR308C YHL007C YKL068W YAL029C YBR109C YGL016W YLR293C YML065W YLR200W YML094W YMR092C YMR186W YDR309C YHR129C YAL041W YBL016W YBL079W YOR127W YPL174C YDR103W YDR323C YDR184C YHR069C YOL021C YOR181W YBR155W YPL140C Figure 1: One of the larger PPI graphs 3

YOL139C YDR180W YGR162W YDR386W YDL043C YMR117C YDR485C YML104C YMR106C YDL030W YDL013W YMR284W YDL042C YNL031C YDR227W YOR191W YNL030W YBR009C YLR134W YBR010W YIL144W YER179W YDL008W YNL172W YDR146C YBR073W YLR127C YDR118W YBL084C YER095W YDR004W YML032C YDR076W YAR007C YJL173C YNL312W Figure 2: Another of the larger PPI connected components. 4

The basic idea is to represent the different sets of interactions (relationships) in terms of graphs. Each graph is defined on the same set of nodes (the genes/proteins that are common to all experiments) and the specific set of relationships are represented as edges in the appropriate graph. Above we created one graph where the edges represent known (literature based) interactions, litg and the second is based on clustered gene expression data, cg1. We can now determine how many edges are in common between the two graphs by simply computing their intersection. commong = intersection(litg, cg1) commong A graphnel graph with undirected edges Number of Nodes = 2885 Number of Edges = 42 And we can see that there are 42 edges in common. Since the literature graph has 315 and the gene expression graph has 156205 we might wonder whether 42 is remarkably large or remarkably small. One way to test this assumption is to generate an appropriate null distribution and to compare the observed value (42) to the values from this distribution. While there are some reasons to consider a random edge model (as proposed by Erdos and Renyi) there are more compelling reasons to condition on the structure in the graph and to use a node label permutation distribution. This is demonstrated below. Note that this does take quite some time to run though. eperm = function (g1, g2, B=500) { + ans = rep(na, B) + n1 = nodes(g1) + + for(i in 1:B) { + nodes(g1) = sample(n1) + ans[i] = numedges(intersection(g1, g2)) + } + return(ans) + } set.seed(123) ##takes a long time #npdist = eperm(litg, cg1) data(npdist) max(npdist) [1] 23 5

2 Data Analysis Now that we have satisfied our testing curiosity we might want to start to carry out a little exploratory data analysis. There are clearly some questions that are of interest. They include the following: ˆ Which of the expression clusters have intersections and with which of the literature clusters? ˆ Are there expression clusters that have a number of literature cluster edges going between them (and hence suggesting that the expression clustering was too fine, or that the genes involved in the literature cluster are not cell-cycle regulated). ˆ Are there known cell-cycle regulated protein complexes and do the genes involved tend to cluster together? ˆ Is the expression behavior of genes that are involved in multiple literature clusters (or at least that we suspect of being so involved) different from that of genes that are involved in only one cluster? Many of these questions require access to more information. For example, we need to know more about the pattern of expression related to each of the gene expression clusters so that we can try to interpret them better. We need to have more information about the likely protein complexes from the literature data so that we can better identify reasonably complete protein complexes (and given them, then identify those genes that are involved in more than one complex). But, the most important fact to notice is that all of the substantial calculations and computations (given the meta-data) are very simple to put in graph theoretic terms. References R. Balasubraminian, T. LaFramboise, D. Scholtens, and R. Gentleman. Integrating disparate sources of protein protein interaction data. Technical report, Bioconductor, 2004. R.C. Cho et al. A genome-wide transcriptional analysis of the mitotic cell cycle. Mol. Cell, 2:65 73, 1998. H. Ge, Z. Liu, G. M. Church, and M. Vidal. Correlation between transcriptome and interactome mapping data from saccharomyces cerevisiae. Nature Genetics, 29:482 486, 2001. 6