Pattern Recognition Letters

Size: px

Start display at page:

Download "Pattern Recognition Letters"

Irene Hopkins
6 years ago
Views:

Pattern Recognition Letters 31 (2010) 2133 2137 Contents lists available at ScienceDirect Pattern Recognition Letters journal homepage: www.elsevier.

Rajapakse a,b,c a Bioinformatics Research Center, School of Computer Engineering, Nanyang Technological University, 639798, Singapore b Department of Biological Engineering, Massachusetts Institute

1 Pattern Recognition Letters 31 (2010) Contents lists available at ScienceDirect Pattern Recognition Letters journal homepage: Building gene networks with time-delayed regulations Iti Chaturvedi a, *, Jagath C. Rajapakse a,b,c a Bioinformatics Research Center, School of Computer Engineering, Nanyang Technological University, , Singapore b Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA c Singapore-MIT Alliance, , Singapore article info abstract Article history: Available online 6 March 2010 Keywords: Dynamic Bayesian networks Gene regulatory networks Viterbi algorithm Skip-chain model Genetic algorithms We propose a method to build gene regulatory networks (GRN) capable of representing time-delayed regulations. The gene expression data is represented in two types of graphical models: a linear model using a dynamic Bayesian network (DBN) and a skip model using a hidden Markov model. The linear model is designed to find short-delays and skip model for long-delays. The algorithm was tested on time-series data obtained on yeast cell-cycle and validated against protein protein interaction data. The proposed method better fits expression profiles compared to classical higher-order DBN and found core genes that are crucial in cell-cycle regulation. Ó 2010 Elsevier B.V. All rights reserved. 1. Introduction Gene expressions when collected over sufficiently large number of time points can be used to derive gene regulatory network (GRN). GRN represent causal interactions of genes and gene products in biological systems and provide a basis for signal transduction in biological pathways. Regulatory events among genes are not necessarily happening at the same time scale and several time-delayed interactions are known to exist in biological systems (Wagner and Stolovitzky, 2008). Since the signal transduction is transient, the study of the dynamics of the transduction is essential. The existing methods of deriving GRN from gene expression time-series can be broadly classified into three categories: networks built by using (a) boolean rules (Li et al., 2007); (b) differential equations (Liu et al., 2009); and (c) stochastic modeling (Gebert et al., 2008). Bayesian networks (BN) have been introduced for building gene regulatory networks in the stochastic framework (Friedman et al., 2000). Boolean networks are not causal and built upon mutual information among nodes. BN are causal and therefore are more biologically plausible and accurate than boolean networks. Ordinary differential equations (ODE) can model the complex regulatory dynamics, but BN can assist building such models by finding the underlying structure. Pathways have a natural representation in BN where genes are present at the nodes of the network and the edges represent causal interactions among them. The causal dependencies are in terms of conditional probabilities which infer cause and effect relationships among the genes in the network. However, BN are acyclic, * Corresponding author. address: iti_c@hotmail.com (I. Chaturvedi). and cannot track time-delayed, feedback, and self-regulatory events. The dynamic Bayesian networks (DBN) can model the temporal dynamics where the parents from the previous time instant are assumed to be regulating the genes (Friedman et al., 1998). This first-order assumption allows feedbacks but still deprives DBN of representing time-delayed interactions. The DBN when extended to higher-order is capable of representing delayed interactions. Mutual information has been used to determine the best time-delay (Zhengzheng and Dan, 2006). However, these generative models become computationally intractable at very high-orders. Therefore, we propose skip model which can handle long term delays in regulatory interactions. The skip model is represented by two types of features: (a) linear features modeling the short-delays; and (b) skip features modeling long-delays. In our model, skip features are modeled by using a hidden Markov model (HMM) where the log-likelihood of the network is decomposed into a sum of conditional probabilities between consecutive pairs of genes. So the maximum likelihood estimate of regulation can be found by using the Viterbi algorithm (VA). Our approach consists of two stages: (a) identification of timedelayed interaction features and computation of Viterbi scores; and (b) prediction of the optimal GRN by using a genetic algorithm (GA). The fitness function of the GA includes the Viterbi scores of the skip model. We demonstrate our method with an application to a long time-series of yeast cell-cycle data. We find core genes that are known to have regulatory effects with differing time delays on the cell-cycle. Earlier we have used skip models to find time-delayed interactions in Mycobacterium tuberculosis (Chaturvedi and Rajapakse, 2009). In this paper, we have detailed our formulation on skip /$ - see front matter Ó 2010 Elsevier B.V. All rights reserved. doi: /j.patrec

2 2134 I. Chaturvedi, J.C. Rajapakse / Pattern Recognition Letters 31 (2010) models and describe implementation on a GA. Experiments were performed on larger sets of genes and higher number of time points obtained on cell-cycle regulation. Further, we show validation of our results by employing protein protein interaction (PPI) data. A validation with BioGrid PPI (Breitkreutz et al., 2008) for higher-order interactions shows that the method is more effective than existing techniques. 2. Methods Consider a set of n genes G ¼fg i : i ¼ 1; 2;...; ng and time-series of gene expressions gathered over T time points for all the genes. Let the gene expression data be x ¼fx i;t g nt in which row vector x i ¼ðx i;t : t ¼ 1; 2;...; TÞ corresponds to gene expression time-series of gene g i. Suppose that gene expressions are discretized into a set C of d levels: C ¼f1; 2;...; dg. Let the set of parent genes regulating the gene g i be denoted as a i and the number of states that a node in a i take to be q i Bayesian networks The Bayesian network (BN) decomposes the likelihood of gene expressions into a product of conditional probabilities by assuming independence of non-descendant genes, given their parents: pðxþ ¼ Yn i¼1 pðx i ja i ; h i Þ where x ¼ðx 1 ; x 2 ;...; x n Þ; pðx i ja i ; h i Þ is the conditional probability of gene expression x i given its parents a i, and h i denotes the parameters of the conditional probabilities. Given the set of conditional distributions with parameters h ¼fh i : i ¼ 1; 2;...; ng, the likelihood can be written as Z pðxþ ¼ pðxjs; hþpðhjsþdh ð2þ Let h ¼ pðx i;t ¼ kja i ¼ jþ and N be the number of instances of h that occur in the training data. Using the property of decomposability (Friedman et al., 1998), pðxþ ¼ Yn Y q i Y d h N i¼1 j¼1 k¼1 The model parameters h are given by the likelihood estimates: N h ¼ P d k¼1 N Then, log-likelihood of the data is given by log pðxþ ¼ Xn i¼1 X q i j¼1 X d k¼1 N N log P d k¼1 N The likelihood approximation is known to be good when a large number of data points are available (Friedman et al., 1998) Dynamic Bayesian networks (DBN) The acyclic condition of BN does not allow self- and feedback-regulations of genes, which are essential characteristics of GRN. The dynamic Bayesian networks (DBN) overcome this by modeling the regulatory network from one time point to the next. A first-order DBN is defined by a transition network of interactions between a pair of structures ðs t ; S tþ1 Þ corresponding to time instances t and t þ 1. In time instance t þ 1, the parents of genes are those specified in the time instant t. The gene ð1þ ð3þ ð4þ ð5þ regulations are obtained by unrolling the transition network over time and assuming first-order stationary behaviour over time. From Eq. (3) the likelihood of the data is given by: PðxÞ ¼ YT Y n Y q i Y d h Nðt;tþ1Þ t¼1 i¼1 j¼1 k¼1 where N ðt;tþ1þ correspond to the number of instances where x i;tþ1 ¼ k while a i;t ¼ j. The first-order DBN has two layers and therefore 2n nodes Hidden Markov models (HMM) The classical DBN is unable to capture complex time-dependencies and is extended to a o-order ðo P 2Þ Markov chain. It predicts the expression levels of a set of genes based on the expressions of up to o previous time points using frequency statistics. Higher-order dynamic Bayesian networks (HDBN) have been proposed to study time-delayed interactions. However, as the order o increases it is not possible to predict the delays because of the difficulty in estimating the increasing number of parameters. Therefore, we resort to a first-order hidden Markov model (HMM) to determine delayed interactions. It determines the probability of expression of a gene g j at time point t, given that g i was observed at s t where s t < t 1 within the section of the time-series of length t s t. Let a sequence of hidden states from time point s t to t be denoted by y st:t ¼ðy st ; y stþ1;...; y t Þ where y t 0 denotes the gene expressed at time point t 0 in the path. Correspondingly, we have the observed data x st:t ¼ðx i 0 ; x ;st i 0 ;stþ1;...; x i 0 ;tþ where x i 0 ;t ¼ k is the discretized gene 0 expression state. The states x i 0 ;t 2f 1; 0; 1g represent down-, 0 un-, and up-regulated genes; Given the microarray data, the maximum likelihood estimation can be used to estimate the state transition and emission probabilities, which are defined as follows (Cappae et al., 2005): M l;m ð6þ a l;m ¼ P n m 0 ¼1 M ; 8y t 0 ¼ g l ; y t 0 þ1 ¼ g m 2 G ð7þ l;m 0 b l ðkþ ¼ M k l P d k 0 ¼1 Mk0 l ; 8k 2 C; t 0 2fs t ; s t þ 1;...; tg; g l 2 G ð8þ where M l;m denotes the number of occurrences where x l;t 0 ¼ x m;t 0 þ1 ¼ 18t 0 2fs t ; s t þ 1;...; tg; 8k 2 C and M k l denotes the number of occurrences where gene g l has been at discrete state level k; 8t 2fs t ; s t þ 1;...; tg Viterbi algorithm When the expression time-series are modeled with an HMM, the maximum a posteriori (MAP) estimate could be used to find the time-delayed interactions of a pair of genes. The path begins and ends at the known states of genes: say, y st ¼ g i and y t ¼ g j. We assume that t s t is not very large and conditional independence between feature vectors. For a sequence of a set of genes, the most probable path is given by the MAP estimate: arg max y st :t pðy st:tjxþ ¼arg max pðxjy y st:tþpðy st:tþ st :t Viterbi algorithm (VA) is a dynamic programming procedure and determines the best path in an incremental manner. Let d m ðt 0 Þ be the probability of the most probable path ending at gene g m with the observation x m;t 0 at time t 0. Then, the best path at the next iteration is found as d m ðt 0 þ 1Þ ¼b m ðkþ max l d l ðt 0 Þa l;m ð9þ ð10þ

3 I. Chaturvedi, J.C. Rajapakse / Pattern Recognition Letters 31 (2010) We can divide the path probability by length of path to get a first-order probability as a goodness of fit of the path. Hence the skip-edge score is the normalized MAP interaction: 1 hðx i ; a i ; s t ; tþ ¼log ðt s t Þ max pðy y st:tjxþ st :t ð11þ where the parent set a i has only one gene at time point s t. Finally, for any pair of genes, the best time-delayed interaction has the highest probability: ^hðx i ; a i ; s t ; tþ ¼ max ðs t;tþ;t s t>o hðx i ; a i ; s t ; tþ ð12þ where g j 2 a i and o is predefined linear order. We find the corresponding delay if ^hðx i ; a i ; s t ; tþ is greater than a predetermined threshold Linear and skip features In order to handle both short- and long-delay interactions, we model gene regulations by using a linear model and a skip model. The skip-chain model is illustrated in Fig. 1. The linear features are given by dotted lines and skip features are given by solid bold lines. The two types of features are expressed in the likelihood of a gene expression x i as a weighted sum of linear and skip-edge scores. For gene g i : log pðx i ja i ; h i Þ/kf ðx i ; a iðt o:tþ ; tþþð1 kþhðx i ; a i ; s t ; tþ ð13þ where f ðx i ; a iðt o:tþ ; tþ and hðx i ; a i ; s t ; tþ represent the linear- and skipfeature functions and k is a weight determined heuristically. Linear-chain feature functions f ðx i ; a iðt o:tþ ; tþ represent local dependencies that are consistent with an o-order Markov assumption of gene expressions. The skip-chain feature functions hðx i ; a i ; s t ; tþ exploit the dependencies between genes that are arbitrarily distant at time instances s t and t respectively (Galley, 2006). It can model variable length Markov chain up to T 1 order. We use DBN to implement a linear-chain model and first-order HMM to implement the skip-chain model. The optimal delays can be found by using a GA to optimize the likelihood. For an o-order HDBN, we can have o ja ij structural possibilities for each gene where ja i j is cardinality of parent set. Hence, the search space and corresponding complexity is very high to find delays. On the other hand, skip models use VA to find the optimal delay and associated probability whose complexity is only quadratic on length of delay (Cappae et al., 2005) which is much smaller than the complexity of GA. 3. Implementation using a genetic algorithm A genetic algorithm (GA) is used to find the optimal network structure. The connectivity structure is given by the connection matrix fc i;j g nn where c i;j denotes the delay of gene g i regulating gene g j. The network is initialized by using mutual information (Zhengzheng and Dan, 2006). We randomize the order of genes during initialization for each individual. The cost function for finding the delays are given by the linear combination of linear- and skip features as given in Eq. (13). The Bayesian score of the graph of low orders can be calculated using Eq. (5). To account for longer delays, for any two genes g i and g j where c i;j > 1, we choose the highest Viterbi score among all the possible interaction features ^hðx i ; a i ; s t ; tþ. If the highest score is less than a threshold, no such skip-edge is determined. GA does optimization of search for the structure, by keeping a population of solutions to the connectivity structure. A random interpolation weight k < 1 can be appended to the individual. The GA then finds the best structure with the highest posterior probability for different combinations of linear score, skip score and k. Crossover and mutation introduce changes in the structure. Crossover involves swapping several rows and the weights between two parents resulting in possibly lower energy structures, while mutations alter a single cell. Here we run the GA for Q generations or if the change in score is less than a q 1 for 20 consecutive generations. As low lying structures can easily dominate the others leading to premature search convergence, a minimum similarity threshold of p s > 0:7 is maintained in each generation. The parameters of GA were found heuristically for maximum likelihood of the structure. 4. Experiments and results We evaluated our method on time-series gene expressions of yeast cell-cycle data obtained from Chou et al. (1998) (17 time points) and Spellman et al. (1998) (24 time points, cdc-15 cell cycle arrest). The yeast cell-division cycle consists of four main phases: genome duplication (S phase), and nuclear division (M phase), separated by two gap phases (G1 and G2). The S-G1-M-G2-S forms a cycle of cell duplication. The expression values were normalized and discretized into 1 for upregulation, 0 for un-regulation, and Fig. 1. State transition diagram illustrating six time points and four genes in a DBN. The states { 1,0,1} represent down-, un-, and up-regulated genes. The dashed edges are linear o ¼ 1; 2 order edges found by linear features. The solid edge represents a skip-edge over four time points.

4 2136 I. Chaturvedi, J.C. Rajapakse / Pattern Recognition Letters 31 (2010) to denote downregulation by using an approach described earlier (Shmulevich and Zhang, 2002). We use Chou dataset on nine genes to demonstrate the control of the sequential activation of cyclins and other cell-cycle regulators (Zhengzheng and Dan, 2006). Smaller subset of nine genes were used because validation with protein protein interaction data is practically not possible for large datasets. Similarly Spellman data was used on subset of genes in different phases. We use a GA to find the optimal structure. Simulation was done with an HDBN up to an order four and skip model with a maximum skip-edge length of 10 time points. We plotted the histogram of mutual information for each pair of genes for different time delays. The peak of the histogram was taken to determine the threshold for the presence of regulatory connections. The parameters of the GA and skip-edge weight were determined empirically for the best structure as given by the likelihood. Simulations were done at different numbers of individuals (N) and generations (Q) (N = 200/ 300/400 and Q = 300/400/500) for both HDBN and skip-chain model. The GA stops when the maximum number of generations is reached or if the score difference is below a certain value for 20 consecutive generations. We consider edges with probability over 0.7 over 20 runs of the GA in the final network. The mean and standard deviations were reported. It is observed that optimal k found by GA is larger for small networks where probability of skip-edge is low and is smaller for large networks where probability of skip interactions becomes higher due to longer cascades of genes. As seen from Table 1, HDBN of order four and skip-chain of order 1:10 have the highest likelihood in all datasets confirming that the network fits expression data well. The HDBN shows a peak of the interactions at delay 1 and 4. This indicates that most interactions are first-order or instantaneous, and the fourth-order may be insufficient to capture all higher-order interactions. In order to further validate our results, we look at the cascades of genes in the GRN, which corresponds to interactions in PPI network. On a subset of 19 S phase genes for which interaction are available in Bio- Grid, we can clearly see that our model gives higher number of true positives than DBN or HDBN. The true positives were relatively low because all the PPI were not available in the database. However, it could be used as an indicator to compare different models. We also see that the method is robust to an increase in the number of genes as most predicted interactions tend to show delays. Bigger networks like S (36) of 52 interactions had several eight time points delay. We also investigate the hubs nodes with high degree of connectivity in the network, which usually represent important nodes in causal networks. Table 2 gives a list of top 10 hubs of networks derived by different methods for 19 genes in S phase. The corresponding hubs in the BioGrid target network are also given. The top core genes produced by all methods seem the same and the core genes produced by the DBN and our method were quite similar. Further comparison of top 10 hubs predicted by a DBN, an HDBN and a skip-chain (Table 2) using Saccharomyces Genome Database (SGD) showed that while a DBN had hubs involved in instantaneous events such as initialization, silencing, etc., the time-delayed hubs in HDBN were mostly regulatory or feedback associated. For e.g. KIP1 (mitotic spindle assembly), MET6 (methionine synthesis) and MSB1 (suppressor of budding) emerge in DBN Table 1 First- and higher-order regulations predicted by DBN (d), HDBN (h), and skip-chain (d:h) models built on datasets by Chou et al. (nine genes) and Spellman et al. (for cycle S-G2- M). The number of genes in the dataset, (likelihood), total number of regulations, true positives with protein protein interaction data. # Genes Model (o) No of connections at different delays # Edges Validation Likelihood Chou (9) d (1) ± 3.13 h (4) ± 0.53 d:h (1:10) ± 0.77 S (19) d (1) ± h (4) ± 7.76 d:h (1:10) ± 0.80 S (36) d (1) ± 2.76 h (4) ± d:h (1:10) ± 2.22 G2 (33) d (1) ± 3.46 h (4) ± d:h (1:10) ± 0.71 M (60) d (1) ± 3.24 h (4) ± d:h (1:10) ± 1.13 Table 2 Top 10 hubs obtained for 19 genes in S phase of yeast cell-cycle, ranked in the order of their connectivity. The hubs obtained from the BioGrid are also given for comparison. Model (o) Rank of genes based on connectivity BioGrid HHF1 HTA1 HHT1 HTB2 HTA2 HTB1 HHT2 HHF2 ADA2 CIN d (1) HTB1 HHF2 HTA1 HHT2 HHT1 MET6 HTB2 KIP1 MSB1 TOF h (1) TOF2 HTB1 HHT2 HHF2 HHF1 HTA1 ADA d:h (1:10) HTB1 HHF2 KAR9 HHT1 MET6 HHT2 HTA2 DFG5 HTB2 TOF

5 I. Chaturvedi, J.C. Rajapakse / Pattern Recognition Letters 31 (2010) while TOF2 (simulates phosphatase activity) and ADA2 (coactivator) are seen in HDBN. Our skip-chain model showed a combination of both types of regulation such as KAR9 (positioning and orientation of spindle) and DFG5 (membrane protein for cell wall formation in buds). As seen in the Table 2, some hubs such as HHF1, HTB1, HTA2, and HHT1 or their homologs are conserved in all the models. These are histones required to initiate duplication by chromatin assembly and chromosome function. ADA2 and CIN8 (spindle assembly) were FP when compared with BioGrid. 5. Discussion and conclusion Pathways are often triggered by transcription factors which in turn express genes and produce proteins. Therefore, the regulatory interactions in molecular pathways can be given by GRN. Gene regulations generally include dynamic feedback loops, cascaded interactions, intermediary factors, etc., which provides for underlying biological mechanisms of regulation. This results in different time delays in regulatory interactions. We have considered higher-order DBN (HDBN) for representing delays in regulations. When larger delays are involved, implementation of HDBN becomes intractable. Therefore, we proposed a skip-chain HDBN. This involved two components: linear model to represent short-delays and skip model to represent long-delays. These two components may represent actions of activator and inhibitor involved in regulatory interactions. Our method was evaluated against earlier approaches, which shows our method better fits the gene expression data when GRN was built. In order to provide a more biological meaningful validation, we performed comparison with the protein protein interaction data. That validation also showed superiority of our technique over other methods but because of the incompleteness of PPI data sources, such comparisons results in large false positives. Skip-chain models address the difficulties of a HDBN by easily incorporating long-time-delayed regulations. The skip model is a first-order HMM and captures long-distance dependencies of input time-course gene expressions. This inference technique leads to lower computational times without loss in the accuracy compared to HDBN. The forward Viterbi path through the trellis determines the best long-distant time delay and therefore automatically finds the best higher-order interactions between genes. The method gives more accurate and biologically plausible networks than DBN implementations as short- and long-time delays are inherent in biological interactions. References Breitkreutz, B.-J., Stark, C., Reguly, T., Boucher, L., Breitkreutz, A., Livstone, M., Oughtred, R., Lackner, D.H., Bahler, J., Wood, V., Dolinski, K., Tyers, M., The biogrid interaction database: 2008 update. Nucl. Acids Res. 36 (suppl. 1), D637 D640. Cappae, O., Moulines, E., Rydaen, T., Inference in Hidden Markov Models. Chaturvedi, I., Rajapakse, J., Detecting robust time-delayed regulation in Mycobacterium tuberculosis. BMC Genomics 10 (Suppl. 3), S28. Chou, R., Campbell, M., Winzeler, E., Steinmetz, L., Conway, A., Wodicka, L., Wolfsberg, T.a., A genome-wide transcriptional analysis of the mitotic cell cycle. Mol. Cell 2, Friedman, N., Linial, M., Nachman, I., Pe er, D., Using bayesian networks to analyze expression data. J. Comput. Biol. 7 (3 4), Friedman, N., Murphy, K., Russell, S., Learning the structure of dynamic probabilistic networks. In: Proc. 14th Annual Conf. on Uncertainty in Artificial Intelligence (UAI-98), pp Galley, M., A skip-chain conditional random field for ranking meeting utterances by importance. In: Proc Conf. on Empirical Methods in Natural Language Processing (EMNLP 2006), pp Gebert, J., Motameny, S., Faigle, U., Forst, C.V., Schrader, R., Identifying genes of gene regulatory networks using formal concept analysis. J. Comput. Biol. 15 (2), Li, P., Zhang, C., Perkins, E.J., Gong, P., Deng, Y., Comparison of probabilistic boolean network and dynamic bayesian network approaches for inferring gene regulatory networks. BMC Bioinform. 8, S13 S20. Liu, B., Thiagarajan, P., Hsu, D., Probabilistic approximations of signaling pathway dynamics. In: Computational Methods in Systems Biology, pp Shmulevich, I., Zhang, W., Binary analysis and optimization-based normalization of gene expression data. Bioinformatics 18 (4), Spellman, P.T., Sherlock, G., Zhang, M.Q., Iyer, V.R., Anders, K., Eisen, M.B., Brown, P.O., Botstein, D., Futcher, B., Comprehensive identification of cell cycleregulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell 9 (12), Wagner, J., Stolovitzky, G., Stability and time-delay modeling of negative feedback loops. Proc. IEEE 96 (8), Zhengzheng, X., Dan, W., Modeling multiple time units delayed gene regulatory network using dynamic bayesian network. In: 6th IEEE Internat. Conf. on Data Mining Workshops 2006, ICDM Workshops 2006, pp

Learning in Bayesian Networks

Learning in Bayesian Networks Florian Markowetz Max-Planck-Institute for Molecular Genetics Computational Molecular Biology Berlin Berlin: 20.06.2002 1 Overview 1. Bayesian Networks Stochastic Networks