BMC 2002, Bioinformatics - PDF Free Download

BMC Bioinformatics BioMed Central BMC, Bioinformatics 3 Methodoloy article Correlatin overrepresented upstream motifs to ene expression: a computational approach to reulatory element discovery in eukaryotes Michele Caselle 1, Ferdinando Di Cunto and Paolo Provero* 3,1 Address: 1 Dipartimento di Fisica Teorica, Università di Torino, and INFN, Sezione di Torino, Torino, Italy, Dipartimento di Genetica, Bioloia e Biochimica, Università di Torino, Torino, Italy and 3 Dipartimento di Scienze e Tecnoloie Avanzate, Università del Piemonte Orientale, Alessandria, Italy E-mail: Michele Caselle - caselle@to.infn.it; Ferdinando Di Cunto - ferdinando.dicunto@unito.it; Paolo Provero* - provero@to.infn.it *Correspondin author Published: 14 February BMC Bioinformatics, 3:7 Received: 13 December 1 Accepted: 14 February This article is available from: Caselle et al; licensee BioMed Central Ltd. Verbatim copyin and redistribution of this article are permitted in any medium for any purpose, provided this notice is preserved alon with the article's oriinal URL. Abstract Backround: Gene reulation in eukaryotes is mainly effected throuh transcription factors bindin to rather short reconition motifs enerally located upstream of the codin reion. We present a novel computational method to identify reulatory elements in the upstream reion of eukaryotic enes. The enes are rouped in sets sharin an overrepresented short motif in their upstream sequence. For each set, the averae expression level from a microarray experiment is determined: If this level is sinificantly hiher or lower than the averae taken over the whole enome, then the overerpresented motif shared by the enes in the set is likely to play a role in their reulation. Results: The method was tested by applyin it to the enome of Saccharomyces cerevisiae, usin the publicly available results of a DNA microarray experiment, in which expression levels for virtually all the enes were measured durin the diauxic shift from fermentation to respiration. Several known motifs were correctly identified, and a new candidate reulatory sequence was determined. Conclusions: We have described and successfully tested a simple computational method to identify upstream motifs relevant to ene reulation in eukaryotes by studyin the statistical correlation between overepresented upstream motifs and ene expression levels. Introduction One of the biest challenes of modern enetics is to extract bioloically meaninful information from the hue mass of raw data that is becomin available. In particular, the availability of complete enome sequences on one hand, and of enome-wide microarray data on the other, provide invaluable tools to elucidate the mechanisms underlyin transcriptional reulation. The sheer amount of available data and the complexity of the mechanisms at work require the development of specific data analysis techniques to identify statistical patterns and reularities, that can then be the subject of experimental investiation. The reulation of ene expression in eukaryotes is known to be mainly effected throuh transcription factors bindin to rather short reconition motifs enerally located upstream of the codin reion. One of the main problems in studyin reulation of ene expression is to identify the Pae 1 of 1

motifs that have transcriptional meanin, and the enes each motif reulates. The usual approach to this kind of analysis beins by identifyin roups of co-reulated enes, for example by applyin clusterin techniques to the expression profiles obtained from microarray experiments. One then studies the upstream sequences of a set of coreulated enes lookin for shared motifs. Examples of this approach as applied to S. cerevisiae are Refs. [1,,4]. In this paper we suest an alternative method which somehow follows the inverse route: enes are rouped into (non-disjoint) sets, each set bein characterized by a short motif which is overrepresented in the upstream sequence. For each set, the averae expression is computed for a certain microarray experiment, and compared to the enome-wide averae expression from the same experiment. If a statistically sinificant diffrence is found, then the motif that defines the set of enes is a candidate reulatory sequence. The rationale for lookin for overrepresented motifs is that, in many instances, reulatory motifs are known to appear repeated many times within a relatively short upstream sequence [,3], so that the number of repetitions turns out to be much bier than what would be expected from chance alone. A somehow related approach, which does not require any previous roupin of enes based on their expression profiles, was presented in Ref. [5], where the effect of upstream motifs on ene expression levels is modeled by a sum of activatin and inhibitory terms. Experimental expression levels are then fitted to the model, and statistically sinificant motifs are identified. Our approach differs in the importance iven to overrepresented motifs, thus considerin activation and inhibition as an effect that depends on a threshold number of repetitions of a motif rather than on additive contributions from all motifs. Clearly the two mechanisms are far from bein mutually exclusive, therefore we expect the candidate reulatory sites found with the two methods to sinificantly overlap. However it is important to notice that the kind of statistical correlation between upstream motifs and expression that our alorithm identifies does not depend on any special assumption on the functional dependence of expression levels on the number of motif repetitions, as lon as this dependence is stron enouh to provide a sinificant deviation from the averae expression when enouh copies of the motif are present. A comparison of our results with those obtained in Ref. [5] is provided in the "Results and discussion" section. The method In eneral the motifs with known reulatory function are not identified with a fixed nucleotide sequence, but rather with sequences where substitutions are allowed, or spaced dyads of fixed sequences, etc. However in this study, in order to test the method while keepin the technical complications to a minimum, we will limit ourselves to fixed short nucleotide sequences, that we call words. While previous studies (see e. []) show that even this simple analysis can ive interestin results, the method we present can easily be eneralized to include variable sequences and other more complicated patterns. The computational method we propose has two main steps: first the open readin frames (ORFs) of an eukaryote enome are rouped in (overlappin) sets based on words that are overrepresented in their upstream reion, compared to their frequencies in the reference sample made of all the upstream reions of the whole enome. Each set is labelled by a word. Then for each of these sets the averae expression in one or more microarray experiments are compared to the enome-wide averae: if a statistically sinificant difference is found, the word that labels the set is a candidate reulatory site for the enes in the set, either enhancin or inhibitin their expression. It is worth stressin that the roupin of the enes into sets depends only on the upstream sequences and not on the microarray experiment considered: It needs to be done only once for each oranism, and can then be used to analyse an arbitrary number of microarray experiments. It is precisely this fact that should allow the extension of the method to patterns more complex than fixed sequences, while keepin the required computational resources within reasonable limits. Constructin the sets We consider the upstream reion of each open readin frame (ORF), and we fix the maximum lenth K of the upstream sequence to be considered. The choice of K depends on the typical location of most reulatory sites: in eneral K is a number between several hundred and a few thousand. For each ORF, the actual lenth of the sequence we consider is K defined as the minimum between K and the available upstream sequence before the codin reion of the previous ene. For each word w of lenth l (6 l 8 in this study), and for each ORF we compute the number m (w) of occurrences of w in the upstream reion of. Non palindromic words are counted on both strands: therefore we define the effective number of occurrences n (w) as Pae of 1

n w m w m w if w w n w m w if w w ( )= ( )+ ( ) ( 1) ( )= ( ) = ( ) where is the reverse complement of w. w We define the lobal frequency p(w) of each word w as p( w)= n L ( w) ( w) ( 3) where, in order to count correctly the available space for palindromic and non palindromic words, L( w)= K l+ 1 if w w 4 L ( w)= K l+ 1 if w w 5 ( ) ( ) ( ) = ( ) p(w) is therefore the frequency with which the word w appears in the upstream reions of the whole enome: it is the "backround frequency" aainst which occurrences in the upstream reions of the individual enes are compared to determine which words are overrepresented. For each ORF and each word w we compute the probability b (w) of findin n (w) or more occurrences of w based on the lobal frequency p(w): b ( w)= L ( ) w L w n L ( w) n p( w) 1 p( w) 6 w n n= n ( ) ( ) ( ) Table 1: Number of data, averae and standard deviation for the 7 time-points. i N(i) R(i) σ (i) 1 68 -.888.59 654 -.378.81 3 6.113.315 4 671 -.1957.3433 5 658 -.43.389 6 684.944.86 7 61 -.8.8886 We define a maximum probability P, dependin in eneral on the lenth l of the words under consideration, and consider, for each w, the set { } ( ) ( ) = ( ) < S w : b w P 7 of the ORFs in which the word w is overepresented compared to the frequency of w in the upstream reions of the whole enome. That is, w is considered overrepresented in the upstream reion of if the probability of findin n (w) or more instances of w based on the lobal frequency is less than P. This completes the construction of the sets S(w). Two free parameters have to be fixed: the lenth K of the upstream reion to be considered and the probability cutoff P for each lenth l of words considered. A result in Ref. [] suests suitable choices of these two numbers: the authors list the 34 ORFs of S. cerevisiae that have 3 or more occurrences of the word GATAAG in their 5 bp upstream reion. 3 out of these 34 ORFs correspond to a ene with known function, and out of these 3 are reulated by nitroen. This result suests to choose K = 5 for the upstream lenth, and a value of the probability cutoff such that three or more instances of GATAAG in the 5 bp upstream reion of an ORF are considered sinificant. Any choice of P between.18 and.1 would satisfy this criterion, and we chose P = :. Tentatively, we kept the same value of P for all values of l. With this choice, the number of instances of a word that are necessary to be considered overrepresented in a 5 bp upstream sequence can be as hih as six for common 6-letter words and as low as one for rare 8-letter words. In particular, our set S(GATAAG) almost 1 coincides with the one discussed in []. However the word GATAAG will not turn out to be sinificant in our study. As noted above, it would be natural to make the probability cutoff P depend on the word lenth, simply because the number of possible words increases with their lenth: For example one could take the cutoff for each word lenth to be inversely proportional to the number of independent words of such lenth. However it turns out that this procedure tends to construct sets that are less sinificant when tested for correlation with expression. Therefore we chose to fix the cutoff at. for all word lenths. It is important to keep in mind that no statistical sinificance whatsoever is attributed to the sets per se: The only sets that are retained at the end of the analysis are the ones that show sinificant correlation with expression. Therefore the choice of the cutoff in the construction of the sets can be based on such a pramatic approach without jeopardizin the statistical relevance of the final result. Pae 3 of 1

Studyin the averae expression level in each set The second step of our procedure consists in studyin, for each set S(w) defined as above, the expression profiles of the ORFs belonin to S(w) in DNA microarray experiments. The idea is that if the averae expression profile in the set S(w) for a certain experiment is sinificantly different from the averae expression for the same experiment computed on the whole enome, then it is likely that some of the ORFs in S(w) are coreulated and that the word w is a bindin site for the common reulatin factor. To look for such instances we consider the ene expression profiles durin the diauxic shift, i.e. the metabolic shift from fermentation to respiration, as measured with DNA microarrays techniques in Ref. [1]. In the experiment ene expression levels were measured for virtually all the enes of S. Cerevisiae at seven time-points while such metabolic shift took place. The experimental results are publicly available from the web supplement to Ref. [1]. We considered each time-point as a sinle experiment, and for each ene we defined the quantity r (i) (1 = 1,..., 7) as the lo of the ratio between the mrna levels for the ene at time-point i and the initial mrna level. Therefore e.. r (i) = 1 means a two-fold increase in expression at timepoint i compared to initial expression. For each time-point i we computed the enome-wide averae expression R(i) and its standard deviation σ (i). These are reported in Tab. 1, where N(i) is the number of enes with available expression value for each timepoint. Then for each word w we compute the averae expression in the subset of S w iven by the enes for which an experimental result is available at timepoint i (in most cases this coincides with S w ): Rw 1 ( i) = N i, w ( ) r S w ( i) ( 8 ) where N(i, w) is the number of ORFs in S w for which an experimental result at timepoint i is available, and the difference Rw( i) = Rw( i) R( i) ( 9) R w (i) is the discrepancy between the enome-wide averae expression at time-point i and the averae expression at the same time-point of the ORFs that share an abundance of the word w in their upstream reion. A sinificance index si(i, w) is defined as ( ) ( ) Rw i si ( iw, ) = N i, w s i ( ) ( 1 ) and the word w is considered sinificantly correlated with expression at time point i if si ( iw, ) > Λ ( 11) In this work we chose Λ = 6: this means that we consider meaninful a deviation of R w (i) by six s.d.'s from its expected value. The sin of si(i, w) indicates whether w acts as an enhancer or an inhibitor of ene expression. Table : Sinificant words related to the PAC motif. word enes timepoints score GATGAG 4 - - - -6.7 - - - 1. GATGAGAT 35 - - - -8. -6.6-6.18-7.86.94 GATGAGA 6 - - - -7.6 - - -6.64.93 GAGATGAG 36 - - - -6.96 - - -6.5.9 AGATGAG 33 - - - -6.17 - - -6.44.91 GAGATGA 4 - - - -6. - - -.83 ATGAGATG 3 - - - -6.96 - - -6.33.8 GAGATG 31 - - - -6.4 - - -.75 TGAGATG 47 - - - -6.6 - - -6.1.7 Pae 4 of 1

. expression (lo ) -. -.4 -.6 -.8 GENOME S(GATGAG) si - -4 GATGAG -1-6 -1. time Fiure 1 expression of the enes in the set S(GATGAG)The averae expression of the enes in the set S(GATGAG) (solid red line) compared to the enome-wide averae expression (dashed reen line) at the seven time points of the diauxic shift experiment. The expression data are the lo of the ratio between mrna levels at each timepoint and the initial mrna level. Results and discussion We found a total of 9 words of lenth between 6 and 8 above our sinificance threshold si > 6. Most of them are related to known reulatory motifs; two words turned out to be false positives due to the presence, in their sets, of families of identical ORF's. Finally, one word does not match any known motif and is a candidate new bindin site. The comparison between our sinificant words and known motifs was performed usin the database of reulatory motifs made publicly available by the authors of Ref. [6], and the CompareACE software [7] available from the same web source. This packae allowed us to compute the Pearson correlation coeffcient of the best alinment between each of our sinificant words and each known reulatory motif (expressed as a set of nucleotide frequencies). We used the followin criterion to associate our sinificant words to known motifs: a motif is considered as identified if at least one sinificant word scores better than.8 when compared to it. A probability value for this choice of the cutoff can be estimated to be a few percent: out of all the 8 independent 6-letter words, 66 (that is 3.17%) score better than.8 with at least one motif. For 7- and 8- letter words we have respectively.1% and 1.51%. Once a motif has been identified, all words which score best with the motif are attributed to it, independently of the score, provided their expression pattern is consistent with the word(s) scorin better than.8. time Fiure statistical sinificance of the set S(GATGAG) The statistical sinificance si(i, w) as defined in Eq. (1) for the word w = GATGAG and timepoints i = 1,..., 7 in the diauxic shift experiment. The dashed line is the sinificance threshold si = 6. PAC and RRPE motifs Nine sinificant words can be associated to the PAC motif [8,4,7], all of them with rather hih scores. They are shown in Tab., where, as in all the followin tables, sinificativity indices are shown only for those timepoints where they exceed our threshold jsij > 6. Given the perfect alinment of these words, it is not surprisin that these sets larely overlap each other: The union af all the nine sets contains a total of 96 enes. As an example, in Fi. 1 we show the averae expression for the enes associated with the word GATGAG as a function of the time, compared to the averae expression computed over the whole enome. Fi. shows the sinificance index for the same set. The enes in this set are shown in Tab. 3 toather with their expression profiles. Two words can be associated with confidence to the motif RRPE [4,7], and are shown in Tab. 4. The union of the two sets contains 76 enes. We see that enes containin the motifs PAC and RRPE are repressed at the late stae of the diauxic shift compared to the early staes. This result is in areement with the expression coherence score data available from the web supplement to Ref. [6]: There one can see that (1) of all known reulatory motifs, PAC and RRPE show the hihest expression coherence for the diauxic shift and () viceversa, of the eiht experimental conditions considered in Ref. [6], the diauxic shift is the one in which both the PAC and RRPE motif show the hihest expression coherence score. STRE and MIG1 motifs A total of ten sinificant words can be associated to the motifs STRE [9,1] and MIG1 [11,1]. It is well known Pae 5 of 1

Table 3: The ORFs in the set S(GATGAG) with their expression profiles. ORF ene timepoints YBL54W.1 -.1 -.18-1.56-1.5 -.79-1.47 YCL59C KRR1.36.6.45 -.69 -.71 -.34-1.69 YDL63C -.3 -.3 -.7 -.9-1.6-1.51 -.6 YDL153C SAS1.41.19.36 -.76 -.97-1.43-1.79 YDR365C.3.6.1 -.38 -.6-1.64-1.94 YGRC -.17 -.6.14.4 -.15.54.86 YGR1C -.3 -.3 -.7 -.3.3 1.43.84 YGR13W NOP7.15 -.6.3 -.9-1.9-1.64 -.56 YGR18C.3.6.38 -.81 -.76 -.89-1.47 YGR19W SYF -.18 -.54.11 -.1 -.3.74.14 YGR145W. -.3.5 -.9-1.9-1.69 -.18 YJL33W HCA4 -.6.1.1 -.94 -.36 -.67 -.6 YKL78W -.4 -.1.4-1.1 -.97 -.71-1.89 YKL17W EBP.1.1.3 -.74 -.56 -.4-1.4 YLR76C DBP9.3.14.3 -.6 -.86 -.67-1.64 YLR41C -.6 -.7.7 -.71 -.71 -.84-1.3 YLR4W -.18 -.3 -.3 -.47 -.51 -. -.7 YML13C PHO84.5.5.54 -.56 -.67 -.3-1.69 YNL61W NOP -.3 -.51 -.4-1.9-1.36 -.5.1 YNL6C GCD1 -.1..1 -.47 -.64-1.1-1.6 YOL141W PPM -.1.1.4 -.84 -.54.4 -. YPL68C -.6 -.1 -.18 -.84-1.9.8 -.89 YPR11C MRD1 -.17 -.3 -.17 -.54 -.6-1.1-1.51 YPR113W PIS1 -.4..6.5.56 1.1-1.3 set averae.5 -.36.14 -.666 -.676 -.679-1.16 enome averae -.89 -.38.113 -.196 -.4.9 -.3 sinificance 1.83.3.17-6.71-5.47-4.6-4.98 Table 4: Sinificant words related to the RRPE motif. word enes timepoints score AAAATTT 5 - - - - - -7.9-8.58.91 AAAATTTT 6 - - - -6.59 - -8.73-1.6.89 that these play an important role in lucose repression (see e..[1,13] and references therein). Most of these words show comparable scores for the two motifs (due to their similarity) so we decided to show them toether in Tab. 5 which shows the two scores for each word. A total of 1 enes belon to the union of all these sets. The UME6 motif Two words are associated to the known UME6 motif, a.k.a. URS1 [14,15], known to be a pleiotropic reulator implicated in lucose repression [16]. They are shown in Tab. 6. The two sets do not overlap, so that a total of 56 enes are associated to this motif. Pae 6 of 1

Table 5: Sinificant words related to the STRE and MIG1 motifs. The words marked * actually score better with the variant STRE' motif (.6 and.55 respectively). word enes timepoints score STRE MIG1 CCACCCCC 35 - - - - - 6.39 -.8.53 CCCCCCCT 8 - - - - - 6.1 -.79.71 CCCCTG 8 - - - 7.6 6.9 7. -.59.54 CAGCCCCT 3 - - - - - 6.4 -.59.4 GCCCCT * 4 - - - - - 7.5 -.59.56 GCCCCCTG * 17 - - - - 6.7 - -.47.46 TACCCC 5 - - - - - 6.9 -.55.85 CCCCCC 56 - - - - 6.48 6.55 6.1.7.8 ACCCCT 9 - - - - - 7.4 -.63.65 GGCCCC 16 - - - - 6.71 - -.5.56 Table 6: Sinificant words related to the UME6 motif. word enes timepoints score GCCGCC 7 - - - - - - 6.3.8 AGCCGCGC 9 - - - - - - 6.63.6 Table 7: Sinificant words of uncertain attribution. word enes timepoints ACTTTC - - - 6. - - - CCCCTGAA 4 - - - 6.5 - - - GCCCCTGA - - - 6.9 - - - Other sinificant words Three words, shown in Tab. 7, are of uncertain status: for the first one, the set S(ACTTTC) contains only enes, makin the statistical sinificance of the result questionable. The word CCCCTGAA scores best with the PDR motif (.58): iven the low sinificance of this score, and the fact that PDR does not seem to be relevant for any other word, this is most likely accidental. The word should probably be considered as belonin to the STRE/MIG1 motif (the scores are STRE:.46, MIG1:.49). Finally the word GCCCCTGA scores best with UME6 (.55), but its expression pattern is more similar to the STRE/MIG1 motifs (scores: STRE:.44, MIG1:.46). False positives due to families of identical or nearly identical ORF's The enome of S. cerevisiae contains a few families of enes whose codin and upstream reions are identical or nearly identical. Consider for example the COS1 ene (YNL336W): the seven enes COS-COS8 have both cod- Pae 7 of 1

7 expression (lo ) 1.5 1.5 S(ATAAGGG) GENOME si 6 5 4 3 1 ATAAGGG -1 time Fiure 3 expression of the enes in the set S(ATAAGGG) Same as Fi. 1 for the enes in the set S(ATAAGGG), our new candidate reulatory motif. time Fiure 4 statistical sinificance of the set S(ATAAGGG) Same as Fi. for the enes in the set S(ATAAGGG). in sequence and 5 kb upstream sequence coincidin better than 8% with the COS1 sequence. Therefore if the upstream sequence of COS1 contains over-erpresented words, they will likely appear in all of the upstream reions. On the other hand, the expression profiles of all the enes in the family will be the same when measured by a microarray experiment, simply because the experimental apparatus cannot distinuish between the mrna produced by the various members of the family, due to crosshybridization between their mrna. Therefore all of the enes of the family are likely to occur in the sets of the words that are overrepresented in their upstream reion, and even a small deviation from the enome-averaed expression acquires a statistical sinificance. We found two instances of this in our analysis: the words GACGTAGC and GGTCGCAC appear to be associated to sinificant enhancement of the correspondin sets of enes at late timepoints in the diauxic shift: however the two sets contain respectively seven out of eiht and all of the COS1-COS8 enes. Since the COS enes are mildly overexpressed, this creates a false statistical sinificance. When one corrects for this, by keepin only one representative of the family, the statistical sinificance of the two sets disappears. A candidate new motif Finally, the word ATAAGGG/CCCTTAT is a candidate new bindin site, since it does not have ood comparison scores with any of the known motifs. It scores best with the AFT1 motif, with a.5 score which is practically meaninless since 84.9% of all independent 7-letter words score the same or better with at least one motif. It is associated with 13 enes, as shown in Tab. 8, which are overexpressed at late timepoints. The averae expression levels for the set and the sinificance index are shown as a function of time in Fis. 3 and 4. Comparison with the results of Ref. [5] As stated in the introduction, the method proposed in Ref.[5] also allows one to identify reulatory motifs without any previous clusterin of ene expression data: a linear dependence of the loarithm of the expression levels on the number of repetitions of each reulatory motifs is postulated, and motifs are ranked accordin to the reduction in χ obtained when such dependence is subtracted from the experimental expression levels. Iteration of the procedure produces a model, that is a set of relevant reulatory motifs, for each expression data set. We can conclude that the two methods tend to find motifs with a different effect on ene expression: probably the best results can be obtained by usin them both on the same data set. In Ref. [5] such a model is presented for the 14 min. time point in the α-synchronized cell-cycle experiment of Spellmann et al., Ref. [17]. We used our alorithm on the same data set to compare the findins. Let us concentrate on the 7-letter words (the lonest considered in [5]). We found 9 sinificant words, reported in Tab. 9. Of these, five coincide with or are very similar to words found by the authors of Ref.[5] (see their Tab. ). The remainin four (AGGCTAA, GGCTAAG, GCTAAGC and CTAAGCG, whose similarity clearly suests the existence of a loner motif) are of particular interest for the purpose of comparin the two methods: If one looks at the dependence of the expression levels on the number of occurrences of these words in the 5 bp upstream reion, one clearly sees the existence of an activation threshold (see Fi. 5, where such dependence is shown for GGC-TAAG). On the Pae 8 of 1

Table 8: The ORFs in the set S(ATAAGGG) with their expression profiles. ORF ene timepoints YBR7W HSP6 -.1.4.36 1. 1.43 3.47.84 YDL133W -.4.3 -.34 -.5 -.56 -. -.3 YDL4W -.36.9 -.51.6.8 4.5 3.6 YIL136W OM45 -.97 -.7.1 -.5 1.3 3.47 1.79 YLR163C MAS1.4 -.1.11 -.1.8.3 -.3 YLR164W -.3 N/A -.7.6 -.18.19 1.69 YLR453C RIF -.7 -.7.3 -.1 -.71.69.8 YML17W RSC9.1.14.8 -.18 -.7 -.3-1.6 YML18C MSC1 -.1..97 1.56 1.36 4.3 3.47 YNL117W MLS1 -.3 -.4.71 -.3 -.7.76 3.18 YPR5C CCL1 -.18 -.36 -.3 -.5 -.4.36. YPR6W ATH1 -.6 -.4.11...6 1.56 YPR17W.9.3 -.7 -.7 -. 1.43.9 set averae -.159.85.16.1.143 1.65 1.337 enome averae -.89 -.38.113 -.196 -.4.9 -.3 sinificance -1.1 1.5 -.8 3.3 3.57 6.7 6.5 Table 9: Sinificant 7-letter words for the 14-minute timepoint in the α-synchronized cell-cycle experiment word enes si.7.6.5 (17) AAAATTT 5-7.63 ACGCGTC 8 6.46 AGATGAG 33-6.96 GATGAGA 5-6.47 GAGATGA 41-6.6 expression (lo ).4.3..1 GCTAAGC (5779) (16) GGCTAAG 17 7.3 AGGCTAA 6.65 CTAAGCG 16 6.89 GCTAAGC 17 6.77 Fiure 5 expression as a function of occurrences of the word GGCTAAG The averae expression of enes presentin n occurrences of the word GGC-TAAG as a function of n in the 14 min. time point of the α-synchronized cell-cycle experiment of Spellmann et al., Ref. [17]. In parentheses is other hand, by lookin at these data one hardly expects a sinificant reduction in χ when tryin to describe this dependence with a straiht line. This should be compared to the same dependence for the word AAAATTT, shown in Fi. 6, which is found by both alorithms. On the other hand, there are two 7-word motifs found in [5] that do not pass our sinificativity threshold, that is CCTCGAC and TAAACAA. -.1 1 occurrences the number of enes with n occurrences of GGCTAAG in the upstream reion. The horizontal line represents the averae expression for the whole enome. Conclusions We have presented a new computational method to identify reulatory motifs in eukaryotes, suitable to identify those motifs that are effective when repeated many times in the upstream sequence of a ene. The main feature that differentiates our method from existin alorithms for Pae 9 of 1

expression (lo ).1 -.1 -. -.3 -.4 -.5 (411) (1) AAAATTT (6) (13) occurrences Fiure 6 expression as a function of occurrences of the word AAAATTT Same as Fi. 5 for AAAATTT. motif discovery is the fact that enes are rouped a priori based on similarities in their upstream sequences. Most of the sinificant words the alorithm finds can be associated to five known reulatory motifs: This fact consitutes a stron validation of the method. Three of them (STRE, MIG1 and UME6) were previously known to be implicated in lucose suppression, while the fact that PAC and RRPE sites are relevant to reulation durin the diauxic shift is in areement with expression coherence data as reported in the web supplement to Ref. [6]. One of the sinificant words we find (ATAAGGG) cannot be identified with any known motif, and is a candidate new bindin site. It is easy, at least in principle, to extend the method to a larer class of reulatory sites. Accordin to our knowlede of ene reulation, this should be done at least in two directions: (1) the analysis should not be restricted to fixed sequences, but extended to motifs with controlled variability; in particular the extension to spaced dyads [18] should be straihtforward; () the combinatorial analysis of bindin sites [6] could also be performed alon the same lines, that is first roupin enes accordin to which combinations of motifs appear in their upstream reion, and then analysin expression profiles within each roup. 1 Our set is smaller that the one reported in Ref. [] because we do not allow the upstream sequence to overlap with the previous ene: this eliminates 7 enes form the set. (7) (14) () (1) References 1. DeRisi JL, Iyer VR, Brown PO: Explorin the metabolic and enetic control of ene expression on a enomic scale. Science 1997, 78:68-686 [http://cmm.stanford.edu/pbrown/explore/]. van Helden J, André B, Collado-Vides J: Extractin reulatory sites from the upstream reion of yeast enes by computational analysis of olionucleotide frequencies. J Mol Biol 1998, 81:87-84 3. Waner A: A computational enomics approach to the identification of ene networks. Nucleic Acids Research 1997, 5:3594-364 4. Tavazoie S, Huhes JD, Campbell MJ, Cho RJ, Church GM: Systematic determination of enetic network architecture. Nature Genetics 1999, :81-85 5. Bussemaker HJ, Li H, Siia ED: Reulatory element detection usin correlation with expression. Nature Genetics 1, 7:167-171 6. Pilpel Y, Sudarsanam P, Church GM: Identifyin reulatory networks by combinatorial analysis of promoter elements. Nature Genetics 1, 9:153-159 [http://enetics.med.harvard.edu/ ~tpilpel/motcomb.html] 7. Huhes JD, Estep PW, Tavazoie S, Church GM: Computational identification of cis-reulatory elements associated with roups of functionally related enes in Saccharomyces cerevisiae, 96:15-114 8. Dequard-Chablat M, Riva M, Carles C, Sentenac A: RPC19, the ene for a subunit common to yeast RNA polymerases A (I) and C (III). J Biol Chem 1991, 66:153-1537 9. Kobayashi N, McEntee K: Identification of cis and trans components of a novel heat shock stress reulatory pathway in Saccharomyces cerevisiae. Mol Cell Biol 1993, 13:48-56 1. Martinez-Pastor MT, Marchler G, Schuller C, Marchler-Bauer A, Ruis H, Estruch F: The Saccharomyces cerevisiae zinc-finer proteins Msnp and Msn4p are required for transcriptional induction throuh the stress-response element (STRE). EMBO J 1996, 15:7-35 11. Nehlin JO, Ronne H: Yeast MIG1 repressor is related to mammalian early rowth response and Wilm's tumour finer proteins. EMBO J 199, 9:891-898 1. Ostlin J, Carlber M, Ronne H: Functional domains in the Mi1 repressor. Mol Cell Biol 1996, 16:753-761 13. Johnston M: Feastin, fastin and fermentin. Glucose sensin in yeast and other cells. Trends in Genetics 1999, 15:9-33 14. Sumrada MA, Cooper TG: Ubiquitous upstream repression sequences control activation of the inducible arinase ene in yeast. Proc Natl Acad Sci USA 1987, 84:3997-41 15. Gailus-Durner V, Chintamaneni C, Wilson R, Brill SJ, Vershon AK: Analysis of a meiosis-specific URS1 site: sequence requirements and involvement of replication protein A. Mol Cell Biol 1997, 17:3536-3546 16. Kratzer S, Schuller HJ: Transcriptional control of the yeast acetyl-coa synthetase ene, ACS1, by the positive reulators CAT8 and ADR1 and the pleiotropic repressor UME6. Mol Microbiol 1997, 6:631-641 17. Spellmann PT, et al: Comprehensive identification of cell cyclereulated enes of the yeast Saccahromyces Cerevisiae by mi-croarray hybridization. Mol Biol Cell 1998, 9:373-397 18. van Helden J, Rios AF, Collado-Vides J: Discoverin reulatory elements in non-codin sequences by analysis of spaced dyads. Nucleic Acids Research, 8:188-1818 19. van Helden J, André B, Collado-Vides J: A web site for the computational analysis of yeast reulatory sequences. Yeast, 16:177-187 Acknowledement Upstream sequences were downloaded, and many cross-checks were performed, usin the impressive collection of packaes available from the Reulatory Sequence Analysis Tools (RSAT) [19] at [http:// www.ucmb.ulb.ac.be/bioinformatics/rsa-tools/] or... [http://embnet.cifn.unam.mx/rsa-tools/]. Pae 1 of 1