Evolutionary Analysis of Functional Modules in Dynamic PPI Networks

Similar documents
Towards Detecting Protein Complexes from Protein Interaction Data

Network Biology: Understanding the cell s functional organization. Albert-László Barabási Zoltán N. Oltvai

Evidence for dynamically organized modularity in the yeast protein-protein interaction network

Predicting Protein Functions and Domain Interactions from Protein Interactions

Protein function prediction via analysis of interactomes

Networks & pathways. Hedi Peterson MTAT Bioinformatics

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Detecting temporal protein complexes from dynamic protein-protein interaction networks

Fuzzy Clustering of Gene Expression Data

Network by Weighted Graph Mining

Introduction to Bioinformatics

An Efficient Algorithm for Protein-Protein Interaction Network Analysis to Discover Overlapping Functional Modules

Differential Modeling for Cancer Microarray Data

Written Exam 15 December Course name: Introduction to Systems Biology Course no

Types of biological networks. I. Intra-cellurar networks

A Multiobjective GO based Approach to Protein Complex Detection

Association Analysis-based Transformations for Protein Interaction Networks: A Function Prediction Case Study

Cell biology traditionally identifies proteins based on their individual actions as catalysts, signaling

A general co-expression network-based approach to gene expression analysis: comparison and applications

Robust Community Detection Methods with Resolution Parameter for Complex Detection in Protein Protein Interaction Networks

PNmerger: a Cytoscape plugin to merge biological pathways and protein interaction networks

Bioinformatics 2. Yeast two hybrid. Proteomics. Proteomics

In order to compare the proteins of the phylogenomic matrix, we needed a similarity

hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference

MTopGO: a tool for module identification in PPI Networks

EFFICIENT AND ROBUST PREDICTION ALGORITHMS FOR PROTEIN COMPLEXES USING GOMORY-HU TREES

SUPPLEMENTAL DATA - 1. This file contains: Supplemental methods. Supplemental results. Supplemental tables S1 and S2. Supplemental figures S1 to S4

Discovering molecular pathways from protein interaction and ge

Systems biology and biological networks

Biological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor

Comparative Network Analysis

An Improved Ant Colony Optimization Algorithm for Clustering Proteins in Protein Interaction Network

Proteomics. Yeast two hybrid. Proteomics - PAGE techniques. Data obtained. What is it?

Small RNA in rice genome

Ensemble Non-negative Matrix Factorization Methods for Clustering Protein-Protein Interactions

Analysis and visualization of protein-protein interactions. Olga Vitek Assistant Professor Statistics and Computer Science

CISC 636 Computational Biology & Bioinformatics (Fall 2016)

Self Similar (Scale Free, Power Law) Networks (I)

Learning in Bayesian Networks

Computational Network Biology Biostatistics & Medical Informatics 826 Fall 2018

Lecture Notes for Fall Network Modeling. Ernest Fraenkel

BMD645. Integration of Omics

Interaction Network Analysis

identifiers matched to homologous genes. Probeset annotation files for each array platform were used to

Introduction to clustering methods for gene expression data analysis

Clustering and Network

Application of random matrix theory to microarray data for discovering functional gene modules

Inferring Transcriptional Regulatory Networks from Gene Expression Data II

Integration of functional genomics data

Protein Complex Identification by Supervised Graph Clustering

Biological Systems: Open Access

Bioinformatics I. CPBS 7711 October 29, 2015 Protein interaction networks. Debra Goldberg

Overview. Overview. Social networks. What is a network? 10/29/14. Bioinformatics I. Networks are everywhere! Introduction to Networks

An Approach to Classification Based on Fuzzy Association Rules

Computational methods for predicting protein-protein interactions

Bioinformatics. Dept. of Computational Biology & Bioinformatics

Understanding Science Through the Lens of Computation. Richard M. Karp Nov. 3, 2007

Phylogenetic Analysis of Molecular Interaction Networks 1

Computational Genomics. Systems biology. Putting it together: Data integration using graphical models

Introduction to clustering methods for gene expression data analysis

networks in molecular biology Wolfgang Huber

Supplementary online material

Computational Systems Biology

Gene Ontology and Functional Enrichment. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein

Identification of protein complexes from multi-relationship protein interaction networks

V 5 Robustness and Modularity

Functional Characterization and Topological Modularity of Molecular Interaction Networks

Iteration Method for Predicting Essential Proteins Based on Orthology and Protein-protein Interaction Networks

Supplementary Figure 3

Weighted gene co-expression analysis. Yuehua Cui June 7, 2013

Supplementary Information

Network Biology-part II

Updated: 10/11/2018 Page 1 of 5

Computational approaches for functional genomics

Comparative RNA-seq analysis of transcriptome dynamics during petal development in Rosa chinensis

Feature gene selection method based on logistic and correlation information entropy

Basic modeling approaches for biological systems. Mahesh Bule

Analysis of Biological Networks: Network Robustness and Evolution

Structure and Centrality of the Largest Fully Connected Cluster in Protein-Protein Interaction Networks

Preliminary Results on Social Learning with Partial Observations

Gene Ontology and overrepresentation analysis

Chapter 16. Clustering Biological Data. Chandan K. Reddy Wayne State University Detroit, MI

Sig2GRN: A Software Tool Linking Signaling Pathway with Gene Regulatory Network for Dynamic Simulation

BIOLOGY 111. CHAPTER 1: An Introduction to the Science of Life

Data Mining Techniques

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison

Grundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson

Function Prediction Using Neighborhood Patterns

Comparison of Protein-Protein Interaction Confidence Assignment Schemes

Analysis of Biological Networks: Network Integration

ANAXOMICS METHODOLOGIES - UNDERSTANDING

Clustering of Pathogenic Genes in Human Co-regulatory Network. Michael Colavita Mentor: Soheil Feizi Fifth Annual MIT PRIMES Conference May 17, 2015

Protein-protein interaction networks Prof. Peter Csermely

FCModeler: Dynamic Graph Display and Fuzzy Modeling of Regulatory and Metabolic Maps

A Max-Flow Based Approach to the. Identification of Protein Complexes Using Protein Interaction and Microarray Data

Fine-scale dissection of functional protein network. organization by dynamic neighborhood analysis

2 GENE FUNCTIONAL SIMILARITY. 2.1 Semantic values of GO terms

ToxiCat: Hybrid Named Entity Recognition services to support curation of the Comparative Toxicogenomic Database

GRAPH-THEORETICAL COMPARISON REVEALS STRUCTURAL DIVERGENCE OF HUMAN PROTEIN INTERACTION NETWORKS

Dynamic modular architecture of protein-protein interaction networks beyond the dichotomy of date and party hubs

Transcription:

Evolutionary Analysis of Functional Modules in Dynamic PPI Networks ABSTRACT Nan Du Computer Science and Engineering Department nandu@buffalo.edu Jing Gao Computer Science and Engineering Department jing@buffalo.edu Stanley A. Schwartz Department of Medicine sasimmun@buffalo.edu Functional module detection in Protein-Protein Interaction (PPI) networks is essential to understanding the organization, evolution and interaction of the cellular systems. In recent years, most of the researches have focused on detecting the functional modules from the static PPI networks. However, sometimes the structure of the PPI networks changes in response to stimuli resulting in the changes of both the composition and functionality of these modules. These changes occur gradually and can be thought of as an evolution of the functional modules. In our opinions the evolutionary analysis of functional modules is a key to form important insights of the functional modules underlying behaviors, particularly when targeting complex living systems. In this paper, we propose a novel computational framework which integrates a PPI network with multiple dynamic gene coexpression networks to categorize and track the evolutionary pattern of functional modules over consecutive timestamps. We first propose a method to construct dynamic PPI networks, and then design a new functional influence based algorithm to detect the functional modules from these dynamic PPI networks. Based on the results of this approach, we provide a simple but effective method to charac- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ACM-BCB 12, October 7-10, 2012, Orlando, FL, USA Copyright 2012 ACM 978-1-4503-1670-5/12/10...$15.00. Yuan Zhang Kang Li College of Electronic Computer Science and Information and Control Engineering Department Engineering Beijing University of Technology kli22@buffalo.edu Beijing, 100124, China zhangyuan@emails.bjut.edu.cn Supriya D Mahajan Department of Medicine smahajan@buffalo.edu Aidong Zhang Computer Science and Engineering Department azhang@buffalo.edu Bindukumar B Nair Department of Medicine bnair@buffalo.edu terize and track the evolutionary patterns of dynamic modules, which involves detecting evolutionary events between modules found at consecutive timestamps. Extensive experiments on the fermentation process dataset of S. cerevisiae show that the proposed framework not only outperforms previous functional module detection methods, but also efficiently tracks the evolutionary patterns of functional modules. Categories and Subject Descriptors J.3 [Life And Medical Sciences]: Biology and Genetics General Terms ALGORITHMS 1. INTRODUCTION Protein Protein Interaction (PPI) networks help us systematically analyzing the structure of a large living system and also allow us to understand principles like essentiality, protein interactions, functional modules and cellular pathways. The identification of functional modules in PPI networks is of great interest as it often reveals unknown functional ties between proteins and thus helps in predicting functionalities of unknown genes. However, traditional functional module detection approaches treat the PPI network as a static graph, where the graph is either derived from data which is fixed at a certain timestamp or aggregated from the data collected over a period. These approaches ignore the temporal evolution of the functional modules which can offer biologists valuable insights. In the absence of capturing the inherent dynamic charac- ACM-BCB 2012 250

teristics within the PPI networks, one may miss the opportunity to capture the evolutionary pattern of functional modules. Protein-Protein interactions are often subjected to external stimuli and this results in a change in the structure of the network during the development. These dynamically varying interactions which sometimes are referred to as transient interactions are caused by stimuli that may be either reactive (caused by exogenous factors, such as a response to environmental stimulus) or programmed (due to endogenous signals, such as cell-cycle dynamics or developmental process) [23]. Also, the functional modules detected at each timestamp may evolve regularly as the protein interactions dynamically change over time. Specifically, detecting the functional module evolution, that is, the module s functions change over time, provides insights into the underlying behavior of the molecular system. For example, network dynamics can describe how cells respond to environmental cues or how an interaction network changes during development. It is also worth mentioning that temporal evolution of the functional modules will also be very useful for monitoring chronic and genetic disease development and outcome. Thus we believe that it is promising to track the evolution of functional modules and proteins in the dynamic PPI networks. In this paper, we propose a framework to categorize and track the evolutionary pattern of functional modules over consecutive timestamps. Accordingly, we begin by constructing a series of dynamic PPI networks based on both the PPI network and the dynamic gene coexpression networks during various timestamps. We then solve the functional module detection problem with a novel functional influence based algorithm which quantifies the influence from one biological component to another. In addition, the proposed functional module detection method maintains certain levels of module equivalence between consecutive timestamps, the detailed definition of which will be discussed in Section 2.2. Finally, we try to capture complex evolutionary patterns of functional modules over time by analyzing the key evolutionary events among modules in consecutive timestamps. In summary, there are three main contributions of our paper: (i) we propose a novel method to construct the dynamic PPI networks by integrating the static PPI network with the dynamic gene coexpression networks; (ii) we propose a new functional influence based functional module detection algorithm in which the functional modules detected are allowed to be overlapping and would not change dramatically over short time; (iii) we provide a model for tracking the evolutionary process of functional modules over time. To the best of our knowledge, this is the first work in analyzing the evolutionary patterns of functional modules over consecutive timestamps. The rest of the paper is organized as follows. The proposed approach is presented in Section 2. Extensive experimental results are shown in Section 3. Finally, we conclude our work in Section 4. 2. METHOD We begin by introducing the method of constructing the dynamic PPI networks in Section 2.1. In Section 2.2, we will present the functional influence based algorithm used for detecting the functional modules. Finally, the model we used for tracking the evolution of the functional modules is presented in Section 2.3. 2.1 Dynamic PPI Network Construction Several researchers have worked on integrating static data with dynamic data to discover the temporal evolution of protein interaction networks. Han et al. integrated the PPI networks with gene expression data and suggested that some modules are active at specific times and locations [8]. Qi et al. further noted that the integration of a variety of datasets, including binary interactions, protein complexes and expression profiles, enables the identification of subnetworks that are active under certain conditions [17]. In order to discover the temporal evolution of functional modules, we integrate the static PPI network with a series of dynamic gene coexpression networks. Given a PPI network P = (V,E), where V is a set of proteins and E is a set of interactions between these proteins, let M 1,M 2,..., M T be a set of V n gene expression matrices, where T is thenumberoftimestampsandn isthenumberofsamples (replicates) in the experiments. Our goal is to construct T dynamic PPI networks D 1,D 2,..., D T,eachofwhichisa V V matrix. Note that each gene expression matrix M i (1 i T ) and dynamic PPI network D i (1 i T ) corresponds to a specific timestamp i. Before constructing the dynamic PPI networks, we first need to construct a series of gene coexpression networks G 1,G 2,..., G T. Gene coexpression networks have been used to demonstrate that functionally related genes are frequently coexpressed across multiple datasets and across different organisms [10], and to estimate the underlying regulatory relationships between genes under various experimental conditions [1]. By constructing specific gene coexpression network at each timestamp, e.g., at early stage, intermediate stage and terminal stage of a certain disease, it is possible to identify disease-mediated changes in the network connectivity patterns. For each gene pair, the absolute Pearson correlation coefficient of their expression profiles along samples is calculated, and the output is a V V correlation matrix, which represents expression similarity between each gene pair. Based on these correlation matrices, we can easily construct the gene coexpression network, where each node is a gene and each edge represents that the correlation measure between two genes is greater than a cutoff threshold. This cutoff threshold is used to remove all but the most likely biologicallysignificant relationships, and we choose an appropriate cutoff threshold based on the average correlation similarity from each correlation matrix. Combining static PPI network with time course gene expression data leads to a better understanding of protein or gene function and reveals global changes in network topology that hint at higher level cellular organizational principles and functions [16]. Furthermore, we can regulate the changes of proteins relationships and also track the evolutionary process of the functional modules by integrating the static PPI network with time course gene expression data. AfterwegetthegenecoexpressionnetworksG 1,G 2,..., G T, we integrate them with the PPI network P by the rule that if one interaction exists at both the PPI network P and the i-th dynamic gene coexpression network G i, this interaction would be added to the i-thdynamicppinetworkd i.otherwise, we believe that there is no interaction between this protein pair at this timestamp. An example of constructing dynamic PPI networks is presented in Figure 1. ACM-BCB 2012 251

Figure 1: An example of constructing dynamic PPI networks at five timestamps. 2.2 Functional Influence based Functional Module Detection In recent years, many methods have been developed to detect functional modules in a PPI network, such as Markov Clustering (MCL) [5] which is a fast stochastic flow based clustering algorithm for graph, hierarchical clustering method [7] and spectral clustering method [24]. Furthermore, two of our previous algorithms based on functional influence have also been proposed, which efficiently analyzed large-sized, complex PPI networks [3, 20]. The functional influence algorithm was first proposed by Nabieva et al [13], and the basic idea of it is that influence is propagated from the source proteins to the surrounding neighborhoods, and this process is repeated for each protein until each protein in the graph has an influence score. This influence score represents the amount of functional influence received by the protein for a given function. However, since these approaches are not designed for dynamic graphs clustering, they do not consider the temporal characteristic of the dynamic PPI networks, where the interactions between proteins continuously evolve. Therefore, we propose to design a novel functional influence based method which can effectively identify the protein functional modules that reflect the temporal evolution over consecutive timestamps. Our method also allows the overlapping between the modules and can automatically estimate the optimal number of modules at each timestamp. The Principle of Module Equivalence. Since living systems are subjected to the external stimuli, the interactions between proteins also evolve with time which raises a new challenge for the traditional clustering algorithms. Since in our case, the clusters evolve continuously, which is different with the case in which the traditional clustering algorithms usually handle, some new considerations are needed. On one hand, we expect to detect the functional modules that depend on the current PPI network; on the other hand, we also expect that the detected functional modules do not deviate too dramatically from the previous timestamp s PPI network. Similar principles have also been used in [2]. In other words, since the living system is more likely to change gradually instead of dramatically, we expect certain level of module equivalence between functional modules detected in consecutive timestamps. Moreover, in many cases, the dramatic change of functional modules over a short time could be due to the noise which may come from sample contamination, experimental design or the clustering method. Fulfilling the module equivalence can also help in generating more robust results that are not sensitive to noise; this is validated in the experiment. Figure 2: An example of illustrating module equivalence. (a) the clustering results evolve gradually; (b) the clustering results change dramatically. Consider the simple example shown in Figure 2. There are two clustering results (a) and (b) of 7 proteins over 3 timestamps, where each node is a protein and the nodes enclosed together denotes a cluster. It is easy to notice that, the proteins partitioned into the same cluster are stable in result (a), where each cluster changes gradually over time. On the contrary, the proteins partitioned together in result (b) change dramatically. Therefore, according to the principle of module equivalence, (a) should be preferred. Obviously, it is easier and more reasonable to track the evolutionary patterns of functional modules in (a) than (b). To achieve certain level of module equivalence between functional modules in consecutive timestamps, we propose a method to construct a series of weighted dynamic PPI networks, which takes the PPI network from the previous timestamp into account and guarantees that the modules ACM-BCB 2012 252

change smoothly in consecutive timestamps. Given T timestamps unweighted dynamic PPI networks D 1,D 2,..., D T which have been introduced in Section 2.1, we aim at constructing T weighted dynamic PPI networks WD 1,WD 2,...,WD T, where each dynamic PPI network can be represented as WD i =(V i,e i ). The weight between proteins u and v in WD i is defined as: α, if Duv i 1 =1xor Duv i =1, WDuv i = β, if Duv i 1 =1and Duv i =1, 0, otherwise, where α and β are pre-set weights, and 0 α<β 1. The assumption is that the weight of an interaction between proteins u and v at i-thtimestampisbasedonbothunweighted dynamic PPI networks D i 1 and D i. If a particular interaction exists at both of these consecutive timestamps, we have a high confidence that this interaction is reliable and stable, and thus it would be assigned a high weight β. If this interaction only exists at one of the two consecutive timestamps, it would be less confident that it does not come from noise, and thus it would be assigned a relatively low weight α. It can also be considered as that we use previous PPI network as an evidence to weigh the current network. In addition, when i = 1 it does not have previous timestamp, thus WDuv 1 = α if there is an interaction between protein u and v in D 1. In our experiments, we set α =0.1 and β =0.2. Functional Flow Model. Based on the weighted dynamic PPI networks WD i (1 i T ), we design a modified influence based functional module detection algorithm. We first select some proteins to be the source protein set S which are the start points to propagate the influence based on the weighted degrees of the proteins. A previous research [9] has observed that the connectivity of nodes in biological networks plays a crucial role in cellular functions. The weighted degree of protein u, denoted d(u), is the summation of the weights between u and its neighbors and the formula is shown as Eq. 2, where N(u) is the set of the neighbors of protein u and w uv is the weight of the edge between the protein u and v. d(u) = v N(u) (1) w uv. (2) Secondly, we assign an initial influence weight to each source protein s (s S) and propagates the weight to its neighbors x. The process of computing the initial flow f(s x) from s to x is denoted as: f(s x) = w sx z N(s) wsz F (s), (3) where F (s) is the initial influence score for the source protein which we assign as a constant value 1 and w sx is the normalized weight of the edge between s and x. The influence score of x is then updated by summing of all incoming flows from its neighbors, which is shown as Eq. 4. F (x) = f s(u x). (4) u N(x) After updating the influence weight, x propagates its influence weight to its neighbors, this process is defined as: f(x y) = w xy z N(x) wxz F (x). (5) The flow f(x y) wouldberemovedifitislessthana threshold θ flow. Eq. 4 and Eq. 5 are repeated until there is no more flow in the network. By the end of the flow simulation, we can obtain a flow pattern which is a S V matrix, where each vector is a set of cumulative quantities of functional influences for a particular source protein s over all the proteins. The functional influence profile is a vector where each item reflects the functional influence received from a source protein in the network. In the flow pattern, all the proteins that have a higher functional influence score than the threshold θ flow, would be grouped into a functional module. Merging Preliminary Modules. Note that the preliminary modules extracted from flow pattern are typically overlapped since a protein may have a high functional influence to multiple source proteins. However, the quality of these preliminary modules mainly depends on the source protein selection. Through merging the similar preliminary modules which have a large fraction of common members, we obtain the final modules which have higher accuracy. It is an important step to merge the similar preliminary modules to generate the final modules [6]. Since these final modules are merged from the overlapped preliminary modules, they are also overlapped. The real functional modules are likely to be overlapping, since a molecule generally may perform different biological processes or functions in different environments [26]. In our work, we set θ flow =0.02. In our case, we use a hierarchical clustering algorithm to merge the preliminary modules based on the Jaccard index between modules [25]. However, one difficult issue in functional module detection is to determine the number of clusters. As we know, the classic hierarchical clustering algorithms suffer from the limitation that the number of clusters is specified by users. It is impractical to expect we have sufficient domain knowledge to determine the number of modules for each timestamp. Also, it is unreasonable to assume that the number of clusters at each timestamp is the same. Therefore, in our work, we use the method of [19] which proposed a L curve method to automatically estimate the optimal number of clusters by using the property of the knee shape graph to identify the appropriate number of functional modules. Therefore, in our method, the number of clusters is unbounded, and an optimal number can be automatically determined. 2.3 Evolutionary Events Recently, a few approaches have been proposed to characterize the evolution of clusters over consecutive timestamps in social networks. Takaffoli et al. [22] described an eventbased framework to track the transitions between clusters at consecutive timestamps, and they improved the event formulae to track the entire observation time in a later work [21]. All these works have used a two-stage approach in which the clusters are first detected independently at each timestamp, and then matched to determine the critical evolutionary events. As mentioned before, our functional modules detected from consecutive timestamps are simultaneously influenced by two consecutive timestamps which makes our ACM-BCB 2012 253

framework different. We believe that analyzing the evolutionary pattern of the functional modules detected at each timestamp, including form, dissolve, continue, merge and split, can help us discover underlying evolutionary trends or behaviors of different diseases or species. We state the problem of characterizing the evolutionary pattern of the functional modules in dynamic PPI networks in the following way. At a particular timestamp i, we can detect k i functional modules from the weighted dynamic PPI network WD i which is mentioned in the previous section, denoted as C i = {C1,C i 2, i..., Ck i i }. Note that there are overlapping between modules generated by our method. The evolutionary patterns of functional modules can be represented as a sequence of key evolutionary events (change) in consecutive timestamps. These key evolutionary events cover the evolution of functional modules and can be further formulated as a set of rules. We use the definition of transitionary events from [21], but we only focus on tracking the informative events from consecutive timestamps instead of entire observation timestamps. Given a module Cx i from i-th timestamp, the metric which tracks the optimal module which has the highest similarity with Cx i at (i + 1)-th timestamp, is defined as: track(cx,i+1)=c i y i+1 iff Cy i+1 Vx i Vz i+1 = arg max C i+1 max( Vx, i Vz i+1 ) } α, (6) z C i+1 { where V i x is the set of proteins of C i x, and the overlap threshold α defines whether two modules are matched, which is also used in the definitions of evolutionary events below. So track(c i x,i+1) denotes the optimal matching module for C i x at (i + 1)-th timestamp. If none of the modules in C i+1 has an overlap ratio larger than α, then track(c i x,i+1)= ( denotes an empty matching result). It is worth mentioning that this metric could also be used in the reverse direction with simple revision. The formal definitions of the five evolutionary events are defined as follows: C i+1 y is the continuation of Cx i in the next timestamp. It can also be considered as a module which continues its existence in the consecutive timestamps. Note that we do not ask for two modules to be totally the same. In Figure 3, module C3 2 is the continuation of module C2 1. Formally, a module Cx i in the i-th timestamp continues its existence to the (i + 1)-th timestamp iff: Cy i+1 C i+1 track(cx,i+1)=c i y i+1. (9) Split. If a particular functional module Cx i in i-th timestamp is matched to a set of modules C i+1 = {C i+1 1,C i+1 2,..., C i+1 k } in the coming (i + 1)-th timestamp then we say Cx i is split C i+1. For example, in Figure 3, module C1 1 is split into two modules - C1 2 and C2 2 in the next timestamp. Formally, a module Cx i in the i-th timestamp is split into a set of modules to C i+1 1,C i+1 2,..., C i+1 k, and it is worth noticing that C i+1 C i+1 1,C i+1 2,..., C i+1 k Merge. in the (i + 1)-th timestamp iff: C i+1 = {C i+1 1,C i+1 2,..., C i+1 k } C i+1 : C i+1 y C i+1 : Vx i Vy i+1 α. Vy i+1 (10) If a particular functional module Cx i+1 in (i + 1)-th timestamp is matched to a set of modules C i = {C1,C i 2, i..., Ck} i in the previous i-th timestamp then we say Cx i+1 is merged from C1,C i 2, i..., Ck,andC i i C i. For example, in Figure 3, module C2 3 is merged from three modules - C2 2, C3 2 and in the previous timestamp. Formally, a set of modules C 2 4 C1,C i 2, i..., Ck i in the i-th timestamp is merged into a modules in the (i + 1)-th timestamp iff: C i+1 x C i+1 x : C i y C i : Vy i Vx i+1 V i y α. (11) Form. A particular functional module C i x is marked as form if it did not exist in the previous timestamp. To be more specific, a form indicates that it is the first time a set of proteins are grouped together to perform some function, and some examples are shown as modules C 1 1, C 1 2 and C 2 4 in Figure 3. Thus module C i x is formed in the i-th timestamp iff: track(cx,i i 1) =. (7) Dissolve. A dissolve occurs for a particular functional module Cx i if no similar module exists in the next timestamp. Specifically, a dissolve indicates that it is the last time a set of proteins are grouped together to perform some function, and an example is shown as module C3 1 in Figure 3. Formally, a module Cx i in the i-th timestamp is defined as dissolve iff: track(cx,i+1)=. i (8) Continue. The continue occurs if there is a particular functional module Cy i+1 detected in timestamp i + 1 that is close to a module Cx i in the previous timestamp i-th. We then say Figure 3: An example of functional modules evolution over three timestamps, where five evolutionary events: form, dissolve, continue, split and merge are included. 3. EXPERIMENTS In this section, we show the experimental results of our proposed framework. 3.1 Dataset ACM-BCB 2012 254

To construct the dynamic PPI networks, we have used two data sources, one is the static PPI network, and the other is thetimecoursegeneexpressiondata. Time Course Gene Expression Data. We use a time course gene expression dataset which represents the response of S. cerevisiae in a 15-day wine fermentation that is the process of S. cerevisiae turning the sugar of crushed grapes into alcohol. The dataset consists of seven timestamps (0, 12, 24, 48, 60, 120, and 340 hours which response to different ethanol concentrations), and there is a gene expression matrix created at each timestamp. In order to have a high cover ratio with the PPI network, we used the top 1285 genes which have the most known interactions in the DIP s PPI dataset 1. In addition, for each of the 1285 genes, the primary data consist of three independent biological samples at each of seven timestamps. The raw microarray data are published on Apr. 17, 2008 and available at the National Center for Biotechnology Information database 2 (NCBI) with the accession number GSE8536 [12]. In our experiments, we set the cutoff thresholds for seven timestamps correlation matrices as 0.76, 0.76, 0.83, 0.79, 0.73, 0.76 and 0.70, respectively, corresponding to their average correlation similarity. PPI Network. We used the S. cerevisiae data from the Database of Interacting Proteins 3 (DIP) database which was updated on Feb. 28, 2012. The S. cerevisiae PPI dataset contains totally 22,418 interactions. 3.2 Similarity between Functional Modules over Timestamps As we mentioned before, in the real world, the cellular system evolves gradually over time; thus we believe that the functional modules detected from each timestamp should change smoothly instead of dramatically. We assessed the functional modules similarity across the timestamps by comparing the proposed method with some classical clustering methods: K-means, Hierarchical clustering, Fuzzy c-means clustering (FCM) and Spectral clustering. In addition, since these baseline algorithms are required to preset the cluster number K, thus for each algorithm, we have tested both the cases when K =15andwhenK = 30. Note that among these baseline algorithms, K-means, Hierarchical clustering and Spectral clustering are non-overlapping clustering algorithms, and Fuzzy c-means is an overlapping clustering algorithm in which each node has a membership value for each cluster. In our experiments, if one particular node x s membership value for a cluster Cj i is larger than 0.1 we would assign x to Cj. i We also show our proposed method s performance without considering the module equivalence through the consecutive timestamps. To measure the similarity between the functional modules, we use the Jaccard index, which is defined as: which is between 0 and 1. J(Cx,C i y i+1 )= V x i Vy i+1 Vx i Vy i+1, (12) Then we summed up and av- 1 As list at www.acsu.buffalo.edu/ nandu/genenames.docx 2 www.ncbi.nlm.nih.gov/ 3 http://dip.doe-mbi.ucla.edu/dip/ eraged all the maximal Jaccard value for each module at a certain timestamp to be the final result, where a high value indicates that the modules detected at two separate timestamps are similar, or dissimilar otherwise. The results of all the methods are shown in Table 1. As can be seen, our proposed method shows higher module similarity over all timestamps than the other methods, since the baseline algorithms only consider the PPI network at the current timestamp. It demonstrates that our proposed framework properly handled the functional modules smoothly evolution. 3.3 Functional Module Identification To evaluate the effectiveness of our proposed framework, we used Funcat as the functional annotation from MIPS database. MIPS Functional Catalogue (FunCat) [18] is an annotation scheme for the functional description of proteins of prokaryotic and eukaryotic origin, and we used the top four levels of Funcat for validation. For statistical evolution of the detected modules, we used the p-value from the hypergeometric distribution, which is defined as: m 1 p =1 i=0 ( X i )( V X n i ( V n ) ), (13) where V is the number of proteins in the PPI network, X is the number of proteins in a reference function, n is the size of the modules, and m is the number of proteins in common between the function and the module. It is understood as the probability that at least m proteins in a module of size n are included in a reference function of size X. Alowvalue of p-value demonstrates that the module closely corresponds to the function, since it is not likely that the network will produce the module by chance. Similarly, we assessed the proposed algorithm s performance by comparing it with the baseline algorithms described in Section 3.2. The results are shown in Table 2. As the table shows, our proposed framework remarkably outperforms the baseline algorithms at each timestamp. This result indicates two things: 1) by following the principle of module equivalence, our functional influence based method provides more robust functional modules which are not sensitive to noise; and 2) our functional influence based overlapping functional module detection algorithm is more effective. 3.4 Informative Module Identification In this part, we used the evolutionary events which are defined in Section 2.3 to track the informative behavioral patterns in the evolving graph. We define core-module as the intersection of a series of modules which are linked as a connected graph by the evolutionary events at different timestamps and represents the evolution of its constituent communities ordered by time over the entire timestamps. To be more specific, the core-community is denoted as M = {C t 1 k 1 C t 2 k 2... C tm k m }, where t 1 <t 2 <... < t m. By tracking the critical evolutionary events between timestamps, we found some interesting results. Figure 4 shows the evolving graphs for four α values: 0.6, 0.7, 0.8 and 0.9, respectively. In the evolving graph, each node is a functional module detected at a particular timestamp and each edge is an interaction (event) between modules between two consecutive timestamps. We see from Figure 4 that, as the α increases, the number of detected evolutionary events becomes less and less. Also, the backbone of the evolution ACM-BCB 2012 255

Table 1: Comparing of modules similarity across timestamp t=0-12 t=12-24 t=24-48 t=48-60 t=60-120 t=120-340 Ave Evolution Flow 0.49 0.53 0.55 0.53 0.51 0.51 0.52 Evolution Flow (Without Smoothness) 0.24 0.29 0.32 0.29 0.3 0.28 0.28 K-means (K=15) 0.10 0.13 0.07 0.09 0.09 0.10 0.10 K-means (K=30) 0.19 0.23 0.24 0.21 0.21 0.2 0.21 FCM (K=15) 0.22 0.21 0.22 0.22 0.22 0.25 0.21 FCM (K=30) 0.16 0.15 0.22 0.24 0.14 0.15 0.17 Spectral Clustering (K=15) 0.24 0.27 0.30 0.30 0.26 0.21 0.26 Spectral Clustering (K=30) 0.2 0.16 0.21 0.17 0.17 0.22 0.18 Table 2: Comparing of log(p-value) t=0 t=12 t=24 t=48 t=60 t=120 t=340 Ave Evolution Flow 7.51 10.64 9.03 11.71 8.99 9.56 10.46 9.7 K-means (K=15) 4.66 3.64 3.79 4.63 4.48 3.92 4.21 4.19 K-means (K=30) 4.21 4.34 4.13 3.84 3.82 4.01 3.64 3.99 FCM (K=15) 6.69 8.26 9.09 6.77 8.03 5.43 8.4 7.52 FCM (K=30) 5.18 6.8 6.5 5.79 6.27 5.52 7.39 6.85 Spectral Clustering (K=15) 6.25 7.97 8.57 9.57 8.14 8.17 7.21 7.98 Spectral Clustering (K=30) 5.56 5.33 5.52 5.29 5.32 4.75 5.31 5.29 becomes clearer. Finally, when α =0.9, we can detect a module which is consistent over all timestamps. To make it clearer, we extracted this module and represented it in dashed lines in Figure 4(d). It is easy to note that the coremodule is M = {C1 1 C2 2 C2 3 C1 4 C1 5 C3 6 C1 7 }, which includes 25 core proteins which are POL30, RAD1, PIN3, RAD23, HRT1, YOL087C, RAD7, UBA1, MET30, MGT1, RVS167, HSE1, CDC48, SAN1, PRP8, RPL40A, SNF1, CLB2, KSS1, SWD1, RPL40B, MUS81, SWI5, GRR1 and GPA1. The consistency shows that the proteins which are included in this core-module interact strongly over the entire observation period. This is not surprising since this functional module is essentially involved in cell growth and cell death, as well as ethanol concentrations changing. Such consistency in evolutionary patterns of this module may provide clues about how proteins response to external stimuli during the wine fermentation progression. The top 10 biological process annotations of this core-module M with very low p-value are shown in Table 3, which are calculated by [11]. Some functional key words such as protein ubiquitination, protein conjugation, post-translational modification, response to stimulus and catabolic process, have been proven to play an important role in the process of S. cerevisiae fermentation [15, 14, 4]. 4. CONCLUSIONS In this paper, we proposed a framework for analyzing the evolutionary patterns of functional modules in dynamic PPI networks. Since this framework has considered the inherent dynamic characteristics within the PPI networks, it may provide novel insights into the underlying behaviors of the molecular system. To our best knowledge, this is the first evolutionary analysis of functional modules in dynamic PPI networks. Using the wine fermentation of S. cerevisiae dataset over consecutive timestamps, we demonstrated the gene annotation enrichment of the identified functional modules, the sets of proteins that participate in the same biological function, in high confidence. Also, the results of the experiment in Section 3.4 lead to the conclusion that the proposed framework can categorize and track the evolutionary events of the functional modules effectively, and obtains an informative functional module which plays an important role over the entire observation time. Through deeply analyzing the gene annotations of the functional modules whose evolutionary pattern are distinctive, we may capture important insights of various diseases or creatures. 5. REFERENCES [1] K. Basso and et al. Reverse engineering of regulatory networks in human b cells. Nature Genetics, 37(4):382 390, 2005. [2] Y. Chi and et al. On evolutionary spectral clustering. ACM Transactions on Knowledge Discovery from Data, 3(4):1 30, 2009. [3] Y.-R. Cho, L. Shi, and A. Zhang. flownet: Flow-based approach for efficient analysis of complex biological networks. 2009 Ninth IEEE International Conference on Data Mining, pages 91 100, 2009. [4] J. Ding and et al. Tolerance and stress response to ethanol in the yeast saccharomyces cerevisiae. Applied Microbiology and Biotechnology, 74(2):253 263, 2010. [5] A. J. Enright, S. Van Dongen, and C. A. Ouzounis. An efficient algorithm for large-scale detection of protein families. Nucleic Acids Research, 30(7):1575 1584, 2002. [6] L. Getoor and C. P. Diehl. Link mining: a survey. SIGKDD Explor. Newsl., 7(2):3 12, Dec. 2005. [7] M. Girvan and M. E. J. Newman. Pnas community structure in social and biological networks community structure in social and biological networks- pnas. PNAS, pages 1 9, 2002. [8] J.-D. J. Han and et al. Evidence for dynamically organized modularity in the yeast protein-protein interaction network. Nature, 430(6995):88 93, 2004. [9] H.Jeong,S.P.Mason,A.L.Barabà asi, and Z. N. Oltvai. Lethality and centrality in protein networks. Nature, 411(6833):41 42, 2001. [10] H. K. Lee and et al. Coexpression analysis of human genes across many microarray data sets. Genome Research, 14(6):1085 1094, 2004. [11] S. Maere, K. Heymans, and M. Kuiper. Bingo: a cytoscape plugin to assess overrepresentation of gene ontology ACM-BCB 2012 256

Table 3: Top 10 biological process annotations for the core-module M GO-ID p-value Description 16567 7.92E-10 protein ubiquitination 32446 5.48E-09 protein modification by small protein conjugation 70647 3.42E-08 protein modification by small protein conjugation or removal 43687 4.13E-07 post-translational protein modification 51716 8.25E-07 cellular response to stimulus 43412 1.15E-06 macromolecule modification 42787 1.17E-06 protein ubiquitination involved in ubiquitin-dependent protein catabolic process 6974 1.78E-06 response to DNA damage stimulus 6464 1.88E-06 protein modification process 50896 3.64E-06 response to stimulus Figure 4: Plot of evolving graph with varying α values. categories in biological networks. Bioinformatics, 21(16):3448 3449, 2005. [12] V. Marks and et al. Dynamics of the yeast transcriptome during wine fermentation reveals a novel fermentation stress response. FEMS Yeast Research, 8(1):35 52, 2008. [13] E. Nabieva and et al. Whole-proteome prediction of protein function via graph-theoretic analysis of interaction maps. Bioinformatics, 21 Suppl 1:302 310, 2005. [14] S. Ostergaard, L. Olsson, and J. Nielsen. Metabolic engineering of saccharomyces cerevisiae. Microbiology and Molecular Biology Reviews, 64(1):34 50, 2000. [15] N. Piggott, M. Cook, M. Tyers, and V. Measday. Genome-wide fitness profiles reveal a requirement for autophagy during yeast fermentation. G3 (Bethesda), 1(5):353 67, 2011. [16] T. M. Przytycka, M. Singh, and D. K. Slonim. Toward the dynamic interactome : it s about time. Access, 11(1), 2010. [17] Y. Qi and H. Ge. Modularity and dynamics of cellular networks. PLoS Computational Biology, 2(12):9, 2006. [18] A. Ruepp and et al. The funcat, a functional annotation scheme for systematic classification of proteins from whole genomes. Nucleic Acids Research, 32(18):5539 5545, 2004. [19] S. Salvador and P. Chan. Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms. 16th IEEE International Conference on Tools with Artificial Intelligence, 1(Ictai):576 584, 2004. [20] L. Shi, Y.-R. Cho, and A. Zhang. Functional flow simulation based analysis of protein interaction network. BIBE 10, pages 144 149, 2010. [21] M. Takaffoli, F. Sangi, J. Fagnan, and O. R. Za. Modec - modeling and detecting evolutions of communities. Artificial Intelligence, pages 626 629, 2010. [22] M. Takaffoli, F. Sangi, J. Fagnan, and O. R. Zaiane. A framework for analyzing dynamic social networks. Science, 2010. [23] X. Tang, J. Wang, B. Liu, M. Li, G. Chen, and Y. Pan. A comparison of the functional modules identified from time course and static ppi network data. BMC Bioinformatics, 12(1):339, 2011. [24] S. White and P. Smyth. A spectral clustering approach to finding communities in graphs. Proceedings of the fifth SIAM international conference on data mining, 119:274, 2005. [25] A. Zhang. Protein Interaction Networks: Computational Analysis. 2009. [26] S. Zhang, H.-W. Liu, X.-M. Ning, and X.-S. Zhang. A hybrid graph-theoretic method for mining overlapping functional modules in large sparse protein interaction networks. International journal of data mining and bioinformatics, 3(1):68 84, 2009. ACM-BCB 2012 257