Computation-Based Discovery of Cis-Regulatory. Modules by Hidden Markov Model

Computation-Based Discovery of Cis-Regulatory Modules by Hidden Markov Model Jing Wu and Jun Xie Department of Statistics Purdue University 150 N. University Street West Lafayette, IN 47907 Tel: 765-494-6032 Fax: 765-494-0558 Email: junxie@stat.purdue.edu June 13, 2006 Running head: Discovery of CRM by HMM Key Words: Bayesian Inference; Cis-regulatory Module; Hidden Markov Model; Transcription factor binding site.

Abstract A key component in genome sequence analysis is the identification of regions of the genome that contain regulatory information. In higher eukaryotes, this information is organized into modular units called cis-regulatory modules. Each module contains multiple binding sites for a specific combination of several transcription factors. In this article, we propose a hidden Markov model (HMM) to identify transcription factor binding sites (TFBSs) and cis-regulatory modules (CRMs). For a given genomic sequence, we first select potential TFBSs from a large database (e.g., TRANSFAC), then construct an HMM where the TFBSs are only counted when they occur within a specialized CRM state. The novel features of the proposed method include that it does not assume a small set of TFBSs for a given gene, on the other hand, the method utilizes information from a large collection of well-characterized TFBSs and therefore is computationally more efficient and robust than the de novo methods. Our approach is applied to three data sets with experimentally evaluated TFBSs. The method shows better specificity and sensitivity than other similar computational tools in identifying CRMs and TFBSs. The executable codes of our programs and module predictions across the fly Drosophila genome are available at http://www.stat.purdue.edu/ jingwu/module/. 1 Introduction Gene expression is regulated through the interaction of transcription factors (TFs) with specific cis-regulatory DNA sequences. Recent studies suggest a modular organization of binding sites for transcription factors (Yuh et al. 1998). More specifically, the expression level of a gene is determined by a cooperation of several TFs, whose sites are organized in a modular unit, called a cis-regulatory module (CRM). A major challenge in analyzing genomic 2

sequences is characterizing functional combinations of TF binding sites. The identification of clusters of TF binding sites will aid understandings of the rules governing gene regulation. A TF binding site motif is commonly modeled by a position weight matrix (PWM). Based on an alignment of known sites, a PWM is constructed, whose columns define the probabilities of observing each nucleotide at each position. Using the PWMs to scan genomic sequences, some computational and statistical methods were developed for identifying pairs of TFs that tend to co-occur in close proximity in sequences (Elkon et al. 2003). In addition, approaches of predicting clusters of TFBSs were proposed, including the logistic regression method of Wasserman and Fickett (1998), and a program called CREME (Sharan et al. 2003). These programs consider motifs as a cluster if they lie within a window of a certain length. The programs search for PWMs one by one, and therefore do not attempt to model the modular structure of TFBSs. More recently, hidden Markov models (HMMs) were employed to detect CRMs (Crowley et al., 1997; Frith et al., 2001; Bailey and Noble 2003; Rajewsky et al., 2002; Sinha et al., 2003). HMM, which originated in speech recognition, provides a good stochastic machinery to produce the observed sequences from an underlying model with clusters of binding site motifs. While numerous algorithms already exist, they have several limitations. First, most algorithms focused on a small number of known TFBSs whose PWMs need to be clearly defined before the searches. Secondly, all the algorithms require the user to specify a module window size for the search. Formally, a genomic sequence is parsed into a series of overlapping windows of length l, whose starting positions differ by a specified size δ. The windows with high scores are predicted as CRMs. Since CRMs are known to vary greatly in size, these HMM algorithms could be improved by allowing for variable window lengths. We also want to eliminate the extra parameter δ which could affect predictions. 3

There are another set of de novo approaches, including the Gibbs motif sampler (Lawrence et al. 1993), AlignACE (Roth et al. 1998), BioProspector (Liu et al. 2001) and CisModule (Zhou and Wong 2004). These programs do not utilize any known transcription factor PWMs but try to discover conserved sequences de novo. Though novel motifs may be found, these programs depend on a good collection of multiple genomic sequences that are assumed to contain a common TF binding site motif. Besides, the identified sequence motifs may not match any known TFBS therefore do not provide direct binding information. In this sense, the de novo approaches could be compensated by utilizing available TFBS patterns. Standard Gibbs sampling approaches were developed for aligning single TF motifs (Lawrence et al. 1993; Liu et al. 2001). Zhou and Wong (2004) extended the method to obtain a Bayesian hierarchical model of cis-regulatory module. Their program CisModule could provide simultaneous inference of modules and TFBSs for appropriate sequence sets. Nevertheless, without using the prior information of known PWMs, CisModule estimates a large set of parameters and therefore would often encounter convergence problems. Moreover, Cis- Module searches for TF motifs within a window of given module size, e.g. 100 or 150 base pairs. Determination of the module length best for a data set could be difficult. CisModule has an alternative of using a Metropolis update to determine the module length. However, the length has to be fixed for all modules once it is decided. In this paper we develop a method that combines the good features of HMMs and the flexibility of Bayesian inferences for identifying CRMs and individual TFBSs. We attempt to improve both the existing HMM and the de novo methods by an intermediate approach. First, we select potential TFBSs for each genomic sequence from a large pool of available PWMs, for instance, the TRANSFAC database (Wingender et al. 2000). TRANSFAC is the most complete database of carefully evaluated binding sites, where over 400 experimen- 4

tally derived PWMs are collected. Using all PWMs in TRANSFAC, we systematically scan each genomic sequence in the data set. The statistically significant PWMs will be used to construct an HMM architecture. Another feature of our HMM is to define a hidden state for modules in addition to the states of TFBS motifs. A module consists of several TFBSs and the background sequence connecting them. By defining the module state, only TFBSs that occur within the module state are counted. In addition, the module locations and lengths will be automatically determined. Each module will have its own length most appropriate for the genomic sequence. Furthermore, our HMM is estimated by Bayesian inferences through sampling. The emission probabilities of the HMM are assumed known from the selected PWMs. The remaining unknown parameters are a small set of transition probabilities, which are estimated by maximizing the posterior distribution conditional on the observed genomic sequences. We detect CRMs and TFBSs that give high posterior probabilities using simulated samples. A cis-regulatory module typically locates in genomic sequence near the genes it regulates. Therefore, we focus the applications on promoter sequences which are several hundreds bp near the transcription starting sites (TSSs), for instance, 700 bp upstream and 200 bp downstream of the TSS. On the other hand, the proposed approach works well on long genomic sequences. One of our examples was to predict CRMs in the whole genome of the fly Drosophila. We searched for embryo developmental cis-regulatory elements in genomic sequences of 24 kp long. The extra module state defined in our HMM automatically predict CRMs as islands in the long sequences. Our HMM structure shows an advantage over the existing HMMs which consist of only TFBS states. Our HMM approach was applied in two additional examples, including promoter sequences of E2F target genes, and promoter sequences of muscle specific regulatory regions. The applications showed that our method 5

improved predictions of motifs and CRMs and was computationally efficient. 2 Method 2.1 Selecting candidate PWMs The structure of an HMM depends on a set of specified PWMs that correspond to all possible TFBSs for a given sequence. In our approach, the possible TFBSs are systematically selected from a database of PWMs (e.g. TRANSFAC). Specifically, for a given sequence, we choose the PWMs that have significant scores compared with the scores obtained by random chances. For a single sequence X = {x 1,..., x L }, we consider both the forward strand and its reverse-complement when selecting PWMs. Denote the reverse-complement of X as {x L+1,..., x 2L }. Let P be a given PWM with width w. At position i in the extended sequence, the log odds score of the PWM is defined as s(i) =log P (x i...x i+w 1 ) Q(x i...x i+w 1 ), where Q is a probability matrix for the background. Then we define score S as the maximum value of the log odds score from the PWM: S = max (s(i)). (1) i=1,...,2(l w+1) Under the null hypothesis that there is no instance of the transcription factor binding site P, the null distribution of the score S of the PWM can be estimated by Monte Carlo simulations from a set of randomly selected sequences of length L. More specifically, for the given PWM P, we first calculate the maximum score S 0 in the observed sequence X by equation (1). Then for a set of randomly selected sequences, say B = 200 random sequences, 6

each with the same length as the observed sequence, the maximum scores of the PWM are calculated as S b, b =1,..., B. The p-value of the score S 0 is estimated by p = b I(S b>s 0 ) B, where I( ) is the indicator function. The PWMs from a database (e.g. TRANSFAC) with p-values less than 0.05 are selected for each genomic sequence. This typically results in K 20 PWMs for a sequence of length 900. 2.2 Hidden Markov model A genomic sequence can be considered as an observation from an HMM with the hidden path indicating the module and within-module TFBSs. We introduce the hidden states, the transition probabilities, the emission probabilities, and the likelihood in the following. First, the module and the outside background are defined by two hidden states in the model, i.e. the two circles in Figure 1. Suppose there are at most K possible different TFs in a given sequence. Then K additional hidden states are used to indicate the TFBSs. As a simple example, Figure 1 illustrates an HMM with K = 3 possible TFBS states, denoted by PWMs, and states K + 1 and 0 indicate the background states inside and outside of modules respectively. The hidden path of a module consists of several TFBSs from the set of K possible PWMs, which are connected by the background sequences of state K +1. The modules are further connected by sequences of state 0 to obtain the hidden path of the whole genomic sequence. The transition probabilities between the hidden states are considered as unknown parameters. The notations are specified as follows. At each position of the observed sequence, given the previous state being in the background (state 0), there is a probability r of initiating a module and probability 1 r of staying in the background. If a module starts at a position, then its hidden state becomes K + 1, i.e. the background inside the module, and a series of 7

q K+1 q 3 r 0 K+1 1 PWM3 1 1-r q 0 1 q 1 q 2 PWM2 PWM1 Figure 1: A simplified hidden Markov model with K = 3 possible TFBSs. State 0: background outside of modules; State K + 1: background inside a module. Transition probabilities are indicated at the arrows. transitions will happen within the module. From state K + 1, there is a probability q k of initiating the kth motif site and consequently the following w k positions will have the hidden state PWM k,wherew k is the width of the kth motif. In addition, there is a probability q K+1 of staying in the module background state K + 1 and a probability q 0 of terminating the module, changing from the module to the outside background state 0. The transition probabilities satisfy K+1 k=0 q k = 1. Figure 1 shows the structure of the hidden Markov chain with the transition probabilities indicated at the arrows. The emission probabilities are assumed known. Each PWM k inside a module is selected from a database (e.g. TRANSFAC), therefore its position-specific probabilities Θ k and motif width w k are given. For the regions outside the modules and the background regions within the modules, i.e. states 0 s and states K + 1 s respectively, the emission probability is from a first-order Markov chain with parameter θ 0. This parameter can be easily estimated from the observed sequence data. Under the HMM, the complete sequence likelihood is defined. For notation simplicity, we only consider a single observed sequence. For a group of sequences, we assume sequences are independent. Therefore, the joint probability becomes the product of the probability for each 8

sequence and the estimation procedure is done one by one sequence. Let X denote a single sequence. Let M be the module indicators and A = {A 1,..., A K,A K+1 } the TFBS indicators, where A k, k =1,..., K, is the indicators of the binding sites for the k-th motif and A K+1 is the indicator of the non-site background sequences inside the modules. We use X(M c ) to denote the background sequence outside of the modules. Denote Θ = {θ 0, Θ 1,..., Θ K }, where θ 0 is the background parameter of the first-order Markov model and Θ k is the k-th PWM. Then, given Θ, q = {q 0,q 1,..., q K,q K+1 } and r, the complete sequence likelihood with M and A given is P (X, M, A Θ, q,r)=p (M, A r, q)p (X(M c ) θ 0, M)P (X(M) M, A, Θ). (2) Note that parameters q and r are unknown but the emission probability parameters Θ are given. 2.3 Baysian inference The HMM with its unknown parameters r and q can be estimated by a standard Baum- Welch algorithm (Baum 1972). We, however, take an alternative approach through Bayesian framework. Combining Equation(2) with prior distributions for the parameters q and r gives rise to the joint posterior distribution: P (M, A, q,r X, Θ) P (X, M, A Θ, q,r)π(q)π(r), (3) where π(q) is the conjugate prior distribution for q, a Dirichlet distribution with parameter α = {α 0,..., α K,α K+1 },andπ(r)isbeta(a, b). By simulating from the posterior distribution, we obtain the estimated parameters ˆq, ˆr and consequently the hidden path ˆM, Â. Themost probable hidden path provides the optimal locations of the modules and the binding sites. 9

The Bayesian inferences of the unknown parameters and the hidden path are derived by iteratively sampling. First the initial values of q and r are generated from the prior distributions Dirichlet π(q) and Beta π(r) respectively. Then the following two steps continuously update the parameters. 1. Sampling M and A given q and r. To simulate the hidden path of M and A, theforward algorithm of HMM approaches (Durbin et al. 1998) is first used to calculate the marginal probability of X, summing over all possible hidden paths. Consider a sequence X = {x 1,..., x L }. For a partial sequence {x 1,..., x m }, m L, leth m (k) be the probability of the observed sequence, requiring that the last residue x m is in state k, h m (k) =P ({x 1,..., x m }, state of x m = k Θ, q,r), k =0, 1,..., K, K +1. If k is a motif, then the last residue is the last position of the motif. Note that P (X Θ, q,r)= K+1 k=0 h L(k). With the HMM shown in Figure 1, the recursion formulae are given by h m (0) = (1 r)p (x m x m 1,θ 0 )h m 1 (0) + q 0 P (x m x m 1,θ 0 )h m 1 (K +1), h m (K +1) = K rp(x m x m 1,θ 0 )h m 1 (0) + h m 1 (k) + q K+1 P (x m x m 1,θ 0 )h m 1 (K +1), h m (k) = q k P ({x m wk +1,..., x m } Θ k )h m wk (K +1), k =1,..., K, (4) k=1 where P ({x m wk +1,..., x m } Θ k ) is the probability of generating the segment from the kth motif model PWM k. The initial conditions of Formula (4) are h 0 (0) = 1 and h m (k) = 0, for k =1,..., K +1andm 0. 10

With all the values h m (k) calculated,backward sampling is used to sample M and A as follows. Starting from m = L, at position m, we generate the hidden state of x m as either the background outside of modules, i.e. state 0, or the last position of a module, i.e. one of the states k =1,..., K, K + 1, by probabilities proportional to the corresponding h m (0), h m (k) andh m (K + 1) in Formula (4). Once we generate the last position of a module and suppose we are backward at position m in the current module, the sequence segment {x m wk +1,..., x m },(k =1,..., K, K +1), is drawn as a background letter (w K+1 = 1) or a site for one of the K motifs with probability proportional to h m (k) andh m (K + 1) in Formula (4). Depending on the generated state, we move to the front position and repeat the sampling procedure until reach the first position of the sequence. 2. Sampling q and r given M and A. Given the current sample M and A, denote the total number of modules by M, the length of each module by l j, j =1,..., M, the number of sites for the k-th motif by A k,and A K+1 = M j=1 l j K k=1 A k w k is the total length of non-site background segments within the modules. Then [q M, A] follows Dirichlet( M + α 0, A 1 + α 1,..., A K + α K, A K+1 + α K+1 ). Similarly, [r M] =Beta( M + a, L M j=1 l j + b), where L is the total length of X. We infer the module by the marginal probability of M, i.e. each sequence position being sampled as within modules. Formally, the sampling procedures are repeated 1000 times. The positions where the frequency of being a module is > 0.5 are predicted as modules. Similarly, the positions where the frequency of being PWM k, k =1,..., K, is greater than a cutoff are predicted as the TFBSs. The overlapping motifs are removed by selecting the one 11

that has a larger frequency. 3 Results We compare our HMM with several algorithms, including the de novo methods, the other program using TRANSFAC database (CREME by Sharan et al. 2003), and the existing HMM approach. To clarify the notation, we call our HMM Module HMM. 3.1 E2F Binding Sites The E2F family of transcription factors plays a crucial and well-established role in cell cycle progression. Experimental identification of E2F regulated genes is time consuming and expensive. Whereas, computational approaches such as our HMM method would provide useful guidelines to detect E2F binding sites and hence identify new target genes bound by E2F. A set of experimentally proven E2F binding sites has been reported by Kel et al. (see Table 1 in Kel et al. 2001). Using the gene names provided there, we collected 17 human promoter sequences from the genome database at UC Santa Cruz (genome.ucsc.edu). Each promoter sequence is 900 bp long with 700 bp upstream and 200 bp downstream from the transcription start site. These 17 human promoter sequences contain 21 experimentally verified E2F sites according to Kel et al. (2001). The Module HMM method was applied to each of the 17 sequences. First we decided the set of possible TFBSs for each sequence. All the PWMs in TRANSFAC were used to scan a sequence. The most significant PWMs with p-values less than 0.10 are selected, which gives 20 to 30 PWMs for constructing an HMM. Based on experiences, the prior parameters of 12

the Dirichlet distribution of q are set as α 0 = α 1 =... = α K =1andα K+1 = 100. The prior parameters of the Beta distribution for r are a = 100 and b = 800. The sampling procedure was applied 1000 times on each sequence. The regions that were sampled as modules with more than 50% of times were predicted as modules. Within modules, the most frequently occurring PWMs (out of 1000 simulations) were reported as the TFBSs. We evaluated Module HMM by comparing it with other programs. The existing HMM algorithms (Frith et al. 2001; Ahab algorithm in Rajewsky et al. 2002; Sinha et al. 2003) require a small set of known TFBSs contained in the sequences, which is not an assumption in this example. We compared our HMM approach with CREME (Sharan et al. 2003), BioProspector (BP) (Liu et al. 2001), and CisModule (Zhou and Wong 2004). Table 1 shows the true positive (sensitivity) and false positive (specificity) of Module HMM with each of the three programs. The true positives were referred to as the number of correctly predicted E2F sites (out of 21 true E2F sites). The false positives were obtained by applying the programs in data sets of randomly generated sequences. The random sequences were generated by a firstorder Markov chain whose probability matrix was estimated from another set of randomly selected human promoter sequences in the genome database of UCSC (genome.ucsc.edu). Each random sequence has the same length of 900 bp. In Table 1, the first comparison shows Module HMM identified 12 true E2F sites out of 21 in the known E2F binding sequences while only predicted 5 E2F sites in 100 random sequences. On the contrary, CREME identified none site at a signficant level 5%. In the second comparison of Module HMM with BioProspector, we did not assume a CRM must contain E2F sites but considered CRMs with any type of TFBS motif. Module HMM predicted 612 TFBSs in the 100 random sequences with 57% true positive rate in the known E2F binding sequences, whereas BioProspector predicted 623 TFBSs in the 100 random 13

sequences with 3 true positive. The third comparison of Module HMM with CisModule was on a combined data set of 37 sequences, where 17 were from the known E2F sequences and 20 were random sequences. Again, Module HMM had a similar false positive rate with CisModule in the random sequences (15 vs. 13 predictions of any motif in 20 random sequences) but higher true positive in the known E2F sequences (7 vs. 0). True E2F sites predicted in the E2F set Prediction rate of E2F in 100 random sequences Module HMM 12/21 (57%) 5 CREME 0 < 5% True E2F sites predicted in the E2F set Prediction rate of any motif in 100 random sequences Module HMM 12/21 (57%) 612 BP 3/21 (14%) 623 True E2F sites predicted in the E2F set Prediction rate of any motif in 20 random sequences Module HMM 7/21 (33%) 15 CisModule 0 13 Table 1: Comparison of Module HMM, CREME, BioProspector, and CisModule. The first column shows true positives in the 17 known E2F binding sequences. The last column gives false positives in random sequences. The value 5% is the significant level of CREME. Besides identifying the E2F sites in the data set, our HMM provides cis-regulatory module predictions. One of the predictions was confirmed by a well-studied promoter sequence of human CDC2, whose proximal TFBSs has been annotated in the upstream of about 300 14

bp (Tommasi and Pfeifer 1995). There are a total 12 annotated TFBSs (see Figure 6 in Tommasi and Pfeifer 1995), including an E2F binding site. Our HMM approach identified 10 sites, clustered into 3 modules. The result is shown in Figure 2. In the remaining regions of the sequence, there were more binding sites predicted (not shown in Figure 2) which would be putative TFBSs. AGACCCAGTC TCTAAATGCA TGCCTCTCTC TATATATTTA AAATTCTGAT GTGAAAATAT TTTAAAATTT AATACATTTC -620 AAATGTTTTT AATTGTATAA TAAACAAAAT GTAAATAATA AAATAATTTA ATATTAAATT CAAAAATGAG GTAGAAACAA -540 AGCACAGCGA TATAAATAAT AAATTTTCCT TTACATTTTT GAGGCGGTCT TTTGAGTTTT CCATTTCCTT CTTAAGGTCA -460 CTGAAATGTG CTCCTTGGAG CCAGCCCGCA AATCACGCAT TTAGAAAAAC ATAACTATAC ACTCCTAACC CTAAGTATTA -380 GAAGTGAAAG TAATGGAATC TCGATGTAAA CACAATATCA CTTTTTTGAT GAGCTATTTT GAGTATAATA AATTTGAACT -300 GTGCCAATGC TGGGAGAAAA AATTTAAAAG AAGAACGGAG CGAACAGTAG CTTCCTGCTC CGCTGACTAG AAACAGTAGG Sp1 ACGACACTCT CCCGACTGGA GGAGAGCGCT TGCGCTCGCA CTCAGTTGGC GCCCGCCCTC CTGCTTTTTC TCTAGCCGCC E2F ets-2 CCAAT CTTTCCTCTT TCTTTCGCGC TCTAGCCACC CGGGAAGGCC TGCCCAGCGT AGCTGGGCTC TGATTGGCTG CTTTGAAAGT CCAAT CTACGGGCTA CCCGATTGGT GAATCCGGGG CCCTTTAGCG CGGTGAGTTT GAAACTGCTC GCACTTGGCT TCAAAGCTGG -220-140 -60 +20 CTCTTGGAAA TTGAGCGGAG AGCGACGCGG TTGTTGTAGC TGCCGCTGCG GCCGCCGCGG AATAATAAGC CGGGTACAGT +100 GGCTGGGGTC AGGGTCGTGT CTAGGGGACG GCCGAGGGCC TCGGAGGGCG AGTATTGAGG AACGGGGTCC TCTAAGAAGG +180 CCGGACTGGA GGTCAGGGAT +200 Figure 2: The HMM method correctly identified TFBSs in the human CDC2 promoter. The modules are marked in square brackets and the TFBSs are in ovals. The arrow marks the transciption initiation site. Compared to Figure 6 in (Tommasi and Pfeifer 1995), ten out of twelve TFBSs were identified by the HMM method. 3.2 Muscle-Specific Regulatory Modules Extensive studies of genes expressed in skeletal muscle have identified specific transcription factors which bind to regulatory elements to control gene expression. The transcription factors that contribute to skeletal muscle-specific expression include Myf, Mef-2, TEF, SRF, 15

and Sp-1 (Wasserman and Fickett 1998). Analysis of experimentally determined muscle regulatory sequences indicates that the binding sites of these TFs occur in close proximity as regulatory modules. Zhou and Wong (2004) collected a data set with 29 known regulatory sequences for skeletal muscle-specific expression and 10 random upstream sequences from ENSEMBL database (www.ensembl.org) each 200 bp long. We applied the Module HMM method to the same data set and compared it with CisModule. Out of the 29 known regulatory sequences for muscle-specific gene expression, 22 are annotated in a database of muscle-specific regulation of transcription (http://www.cbil.upenn.edu/mtir/homepage.html). The database provides a summary of the published experimental evidences about muscle-specific transcriptional regulation, including the TFBSs. We used the information there to verify our findings of the Module HMM method. Table 2 shows a comparison of Module HMM with CisModule. The number of true positives of each TFBSs are summarized in a format n 1 /n 2,wheren 2 is the number of experimentally annotated (true) TFBSs and n 1 is the number of true sites predicted by a program. The false positive numbers were obtained when the five muscle-specific TFBSs were predicted in the 10 random sequences in the data set. A big improvement of the Module HMM method over CisModule is that we found all Myf binding sites, 16 out of 16 annotated ones, whereas CisModule (as well as BioProspector and MEME) failed to identify any. Corresponding to this high true positive rate, Module HMM predicted 5 (false) Myf sites in the random sequences. For the other TFs, the Module HMM results were better than the results of CisModule as reported in Zhou and Wong (2004), with higher true positive and lower false positive rates, excecpt for TEF where Module HMM had higher false positives (5 vs. 0). Moreover, the Module HMM method produced similar results in repeated simulations. 16

In contrast, CisModule gave different results in different simulations. Algorithm Myf Mef-2 TEF SRF True Positive Module HMM 16/16 10/10 9/10 13/13 CisModule 0 9/10 7/10 6/13 False Positive Module HMM 5 2 5 1 CisModule 5 0 0 Table 2: Comparison of our HMM method with CisModule on identifying muscle-specific regulatory regions. The results of CisModule is cited from (Zhou and Wong 2004). 3.3 Regulatory sequences of Drosophila melanogaster As a comparison with the existing HMM algorithm (Rajewsky et al. 2002; Sinha et al. 2003), we tested our HMM method on the same gene sequences of the fly Drosophila melanogaster in Rajewsky et al. (2002). We collected upstream sequences of 14 Drosophila developmental genes, each 12 kb long. Our goal is to identify regulatory elements in these genetic sequences. From the literature, there are eight well studied transcription factors, i.e., Bicoid (Bcd), Caudal (Cad), Dorsal (Dl), Hunchback (Hb), Kruppel (Kr), Knirps (Kni), Tailless (Tll) and the torso response element (torre), acting at very early stages of Drosophila embryo development (Niessing et al. 1997). Note that the HMM algorithm of Sinha et al. (2003) and the first algorithm (Ahab) in Rajewsky et al. (2002) require known TFBSs. To make a direct comparison, we first built our HMM using only the 8 PWMs provided in Rajewsky et al. (2002). Figure 3 illustrates the cis-regulatory module predictions of four genes even-skipped, giant, hairy, 17

and hunchback. The estimated probability of being sampled as within modules is plotted as a function of the position in the sequence. We chose probability 0.6 as a cutoff to predict a CRM. The known module locations from the literature (Table 1 in Sinha et al. 2003 and the collection at http://webdisk.berkeley.edu/ dap5/data 04/appendix2.htm#crms) are marked by the shaded area in Figure 3. Our Module HMM identified 25 out of total 27 true modules, whereas the HMM algorithm of Rajewsky et al. (2002) and Sinha et al. (2003) identified 15 and 16 respectively. Two additional modules were predicted by Module HMM with probability larger than 0.6. One was in gene Knirps, which was the same prediction as in Sinha et al. (2003). The other was in gene even-skipped, different from Sinha et al. s extra prediction in gene runt. These new predictions may be putative modules to be further investigated. It is more interesting to search for CRMs without assuming known information of the eight Drosophila transcription factors. The HMM algorithm of Sinha et al. (2003) requires known PWMs therefore is not applicable in this case. However, the algorithm Argos (Rajewsky et al. 2002) can predict CRMs from raw genomic sequences. To compare Module HMM with Argos, we started by scanning each 12 kb genomic sequence using all 43 available Drosophila PWMs in TRANSFAC and then predicted CRMs through forward algorithm and backward sampling. The results were comparable with that of Argos, which has 50% true positive rate. For seven well studied genes (e.g. even-skipped, giant, hairy, hunchback, knirps, Kruppel, and tailless), with 15 known modules, Argos identified 7 but we identified 8 when looking over the 12 kb upstream sequences. The false positive rate was assessed by applying Module HMM to sequences randomly selected from the fly genome. In a random set of ten 24 kb genomic sequences, 12 kb upstream and 12 kb downstream, Module HMM predicted 0.8 module per 5 kb, which is lower than the false positive rate of Argo, one module per 5 kb. 18

Frequency 0.0 0.4 0.8 Frequency 0.0 0.4 0.8 0 2000 6000 10000 even 0 2000 6000 10000 giant Frequency 0.0 0.4 0.8 Frequency 0.0 0.4 0.8 0 2000 6000 10000 hairy 0 2000 6000 10000 hunchback Figure 3: Module prediction in four Drosophila genes. The probability of being sampled as within modules is plotted as a function of the position in the sequences. The four genes are even-skipped, giant, hairy, and hunchback respectively. The true module locations are indicated by the vertical lines. 19

Careful investigation of the 43 TRANSFAC PWMs shows that this collection only contains 5 (Bcd, Dl, Hb, Kr, Tll) out of the 8 known transcription factors. When we extended the current TRANSFAC collection of Drosophila transcription factors to include all the 8 TFBSs, Module HMM identified 16 out of 27 true CRMs, or 60% true positive rate, compared with 50% true positive of Argos (Rajewsky et al. 2002) in the 14 genomic sequences. For the seven well-studied genes, Module HMM with the extended collection of PWMs had higher true positive rate but lower false positive rate. Namely, Module HMM predicted 10 out of 15 true modules, whereas the false prediction rate in random sequences is 0.81 module per 5 kb. Table 3 summarizes the comparisons between our HMM approach and the HMM algorithms of Rajewsky et al. (2002) and Sinha et al. (2003). The new features that we introduced in Module HMM increase the sensitivity of CRM prediction. Specifically, a large collection of PWMs would provid better prediction in Module HMM. The next application examined Module HMM in identifying certain CRMs in a whole genome scale search. We focused on CRMs that consist of TFBSs from the 8 transcription factors active in the early Drosophila embryo. Our goal was to predict developmental genes with this type of CRM present near the transcription start sites (TSS). The fly genome sequences were downloaded from UCSC genome database D. melanogaster genome April 2004 (dm2) http://hgdownload.cse.ucsc.edu/downloads.html#fruitfly. There are total 4368 non-redundant genes in our collection, each containing 12 kb upstream and 12 kb downstream sequences from the TSS. Module HMM with the 8 PWMs was run for each 24 kb sequence. We found 479 CRMs located in 327 genes. This indicates a low false positive rate of Module HMM. The predictions were available at http://www.stat.purdue.edu/ jingwu/module/fly/flygenomepredict.txt. This list provides potential Drosophila developmental genes and their regulatory regions. 20

True positive Module HMM Sinha Rajewsky in 14 genes 8 known PWMs 25/27 (93%) 16/27 (59%) 15/27 (56%) PWMs unknown 16/27 (60%) n/a 50% True positive Module HMM Module HMM Rajewsky TRANSFAC TRANSFAC+8PWMs in 7 genes 8/15 (53%) 10/15 (67%) 7/15 (47%) False positive 0.8module/5kb 0.81module/5kb 1 module/5kb Table 3: Comparison of Module HMM method with the HMM of Rajewsky et al. (2002) and Sinha et al. (2003). The true positive numbers are the number of predicted modules over the total number of known modules. The false positives are the predictions of any module type in random sequences. PWMs are selected from TRANSFAC plus the 8 known TFBSs. 4 Discussion The HMM developed here models CRM state in addition to individual TFBSs. To construct an HMM, we first select potential TFBSs for a given sequence from all available PWMs (PWM databases plus the literature), then fit the HMM using the stochastic sampling procedure. These new features improve the performance of the existing HMM algorithm. Compared with other algorithms that do not assume a given set of TFBSs, i.e., Argos (Rajewsky et al. 2003) and CisModule (Zhou andwong 2004), our method utilizes a large database of PWMs to decide potential TFBSs. The available PWMs provide useful information therefore would improve searches in general. More importantly, the annotated TFBSs in our predictions will indicate direct relationships between transcription factors and genes, which is a favorable feature over the de novo methods. On the other hand, as noted by other 21

researchers (Rahmann et al. 2003), the TRANSFAC database is limited by the number and precision of the PWMs. The example of Drosophila in Section 3 implies Module HMM depends on the catalog of known PWMs. Module HMM works the best if all potential TFBSs of a genomic sequence were contained in the PWM pool. In addition to TRANSFAC, it is better to include more annotated PWMs in the pre-selection procedure. Our HMM method searches for modules and motifs one sequence at a time. Therefore it does not require good collections of correlated genes. It works on any size of data sets. Besides, because the emission probabilities for PWMs are obtained from the procedure of pre-selection, the algorithm is computationally efficient and there is no convergence difficulty as the common problem of Gibbs sampling approaches. When applying the Module HMM method in Section 3, we found that multiple runs of the sampling procedures produced similar results, with the similar number of true positives and the same binding site locations. However, both BioProspector and CisModule gave different results in different simulations. The Module HMM method provides a joint probability of modules and within-module motifs, which models the dependence between motifs in a certain level. The current version of the HMM can be extended to complicated models that describe the order or precise spacing of the cooperating TFBSs. A future work would be to connect the results of promoter sequence analysis with the gene expression data. It would be beneficial to identify modules of TFBSs that confer temporal and spatial expression patterns for a group of co-expressed genes. 22

Acknowledgement We are grateful to James Fleet for helping collect the E2F sequence data, Ker-Chau Li for helpful discussion on the work. We also thank the two anonymous referees for useful comments that improve the paper. References Bailey, T.L. and Noble, W.S. 2003. Searching for statistically significant regulatory modules. Bioinformatics, 19 Suppl. 2, ii16-ii25. Baum, L.E. 1972. An equality and associated maximization technique in statistical estimation for probabilistic functions of Markov processes. Inequalities 3, 1-8. Crowley, E.M., Roeder, K. and Bina, M. 1997. A statistical model for locating regulatory regions in genomic DNA. J. Mol. Biol.. 268, 8-14. Dempster, A.P., Laird, N.M. and Rubin, D.B. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B 39, 1-38. Durbin, R., Eddy, S., Krogh, A. and Mitchison, G. 1998. Biological Sequence Analysis. Cambridge University Press, Cambridge, United Kingdom. Elkon, R., Linhart, C., Sharan, R., Shamir, R. and Shiloh, Y. 2003. Genome-wide in silico identification of transcriptional regulators controlling the cell cycle in human cells. Genome Res. 13, 773-780. Frith, M.C., Hansen, U., Weng, Z. 2001. Detection of cis-element clusters in higher eukaryotic DNA. Bioinformatics, 17, 878-889. Kel, A.E., Kel-Margoulis, O.V., Farnham, P.J., Bartley, S.M., Wingender, E. and Zhang, M.Q. 2001. Computer-assisted identification of cell cycle-related genes: New targets for E2F 23

transcription factors. J. Mol. Biol. 309, 99-120. Lawrence, C.E., Altschul, S.F., Boguski, M.S., Liu, J.S., Neuwald, A.F. and Wootton, J.C. 1993. Detecting Subtle Sequence Signals: A Gibbs Sampling Strategy for Multiple Alignment. Science, 262, 208-214. Liu, X., Brutlag, D.L. and Liu, J.S. 2001. BioProspector: discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes. Pac. Symp. Biocomput., 127-138. Niessing, D., Rivera-Pomar, R., La Rosee, A., Hader, T., Schock, F., Purnell, B.A., and Jackle, H. 1997. A cascade of transcriptional control leading to axis determination in Drosophila. J. Cell. Physiol. 173, 162167. Rahmann, S., Muller, T., and Vingron, M. 2003. On the power of profiles for transcription factor binding site detection. Statistical Applications in Genetics and Molecular Biology, 2, 1-27. Rajewsky, N., Vergassola, M., Gaul, U., and Siggia, E.D. 2002. Computational detection of genomic cis-regulatory modules applied to body patterning in the early Drosophila embryo. BMC Bioinformatics, 3. Roth, F.R., Hughes, J.D., Estep, P.E. and Church, G.M. 1998. Finding DNA Regulatory Motifs within Unaligned Non-Coding Sequences Clustered by Whole-Genome mrna Quantitation. Nature Biotechnology. 16, 939-45. Sharan, R., Ovcharenko, I., Ben-Hur, A. and Karp, R.M. 2003. CREME: A framework for identifying cis-regulatory modules in human-mouse conserved segments. Bioinformatics, 19, 1-9. Sinha, S., van Nimwegen, E., and Siggia, E.D. 2003. A probabilistic method to detect regulatory modules. Bioinformatics, 19 Suppl. 1, i292-i301. 24

Thompson, W., Palumbo, M.J., Wasserman, W.W., Liu, J.S., and Lawrence, C.E. 2004. Decoding human regulatory circuits. Genome Research, 14, 1967-1974. Tommasi, S. and Pfeifer, G.P. 1995. In vivo structure of the human cdc2 promoter: Release of a p130-e2f-4 complex from sequences immediately upstream of the transcription initiation site coincides with induction of cdc2 expression. Mol. and Cell. Biol. 15, 6901-6913. Wasserman, W.W. and Fickett, J.W. 1998. Identification of regulatory regions which confer muscle-specific gene expression. J. Mol. Biol. 278, 167-181. Wingender, E., Chen, X., Hehl, R., et al. 2000. TRANSFAC: an integrated system for gene expression regulation. Nucleic Acids Res. 28, 316-319. Yuh, C.H., Bolouri, H. and Davidson, E.H. 1998. Genomic cis-regulatory logic: Experimental and computational analysis of a sea urchin gene. Science, 279, 1896-1902. Zhou, Q. and Wong, W.H. 2004. CisModule: De novo discovery of cis-regulatory modules by hierarchical mixture modeling. PNAS, 101, 12114-12119. 25