The Repetitive Sequence Database and Mining Putative Regulatory Elements in Gene Promoter Regions ABSTRACT

Size: px

Start display at page:

Download "The Repetitive Sequence Database and Mining Putative Regulatory Elements in Gene Promoter Regions ABSTRACT"

Violet Dorsey
5 years ago
Views:

1 JOURNAL OF COMPUTATIONAL BIOLOGY Volume 9, Number 4, 2002 Mary Ann Liebert, Inc. Pp The Repetitive Sequence Database and Mining Putative Regulatory Elements in Gene Promoter Regions JORNG-TZONG HORNG, 1;2 HSIEN-DA HUANG, 1 MING-HUI JIN, 1 LI-CHENG WU, 1 and SHIR-LY HUANG 2 ABSTRACT At least 43% of the human genome is occupied by repetitive elements. Moreover, around 51% of the rice genome is occupied by repetitive elements. The analysis of repetitive elements reveals that repetitive elements in our genome may have been very important in the evolutionary genomics. The rst part of this study is to describe a database of repetitive elements RSDB. The RSDB database contains repetitive elements, which are classi ed into the following categories: exact, tandem, and similar. The interfaces needed to query and show the results and statistical data, such as the relationship between repetitive elements and genes, cross-references of repetitive elements among different organisms, and so on, are provided. The second part of this study then attempts to mine the putative binding site for information on how combinations of the known regulatory sites and overrepresented repetitive elements in RSDB are distributed in the promoter regions of groups of functionally related genes. The overrepresented repetitive elements appearing in the associations are possible transcription factor binding sites. Our proposed approach is applied to Saccharomyces cerevisiae and the promoter regions of Yeast ORFs. The complete contents of RSDB and partial putative binding sites are available to the public at The readers may download partial query results. Key words: database, DNA, data mining, repetitive elements, genes. 1. INTRODUCTION The work of Li (2001) estimated that at least 43% of the human genome is occupied by four major classes of interspersed repetitive elements, i.e., LINEs, SINEs, LTR elements, and DNA transposons. The analysis in this work has provided some insights into the evolutionary genomes of the human genome. There are many repetitive elements in our genome, and they may have been very important in the evolutionary genomics. 1 Department of Computer Science and Information Engineering, National Central University, Taiwan. 2 Department of Life Science, National Central University, Taiwan. 621

2 622 HORNG ET AL. RepBase 1 (Jurka, 1998) and STRBase 2 are two representative databases of repeat sequences. RepBase is a collection of repeat sequences, and its goal is to collect, review, and systematically organize representative examples or consensus sequences of repetitive elements from all eukaryotic species. Analysis of the repetitive DNA sequences in this work (Jurka, 1998), combined with experimental research, reveals a history of complex intracellular ecosystems of transposable elements that are inseparably associated with genomic evolution. STRBase stores microsatellites (short tandem repeats, STRs) which contains 2 5 bp (base pair) repeats. These repeats are widespread throughout the human genome and show suf cient variability among individuals in a population that they have become important in several elds including genetic mapping, linkage analysis, and human identity testing. Identi cation of transcriptional regulatory elements within promoter regions is very much of interest to biologists these elements govern the regulation of gene expression. Transcription factors, which are proteins, play a major role in gene regulation in eukaryotic organisms. The factors can bind to speci c sites, transcription factor binding sites or regulatory sites, in the promoter region of particular genes and interact with RNA polymerase and other factors to regulate the transcription of gene expression. Experimentally regulatory pro les of known and unknown genes can be determined at a genomic scale, thanks to the new technologies such as DNA microarray technology (van Helden, 1998). De Risi et al. (1997) have studied the diauxic shift in yeast and found several distinct gene groups by clustering the gene expression pro les. De Risi et al. (1997) show the presence of several regulatory sites in promoter regions of the respective genes. Van Helden et al. (1998) also studied the dataset constructed by De Risi et al. (1997) and systematically searched in promoter region of potentially co-regulated genes for overrepresented oligonucleotides which are possible transcription factor binding sites and may involve gene regulation. A systematic analysis of over-represented sequence patterns in clusters of promoter regions obtained by clustering the diauxic shift expression pro les has been done by BrNazma et al. (1997, 1998). It was shown that overrepresented patterns occurrence in promoter regions for genes from expression pro le clusters of at least 25 genes cannot be explained by statistical chance. Many experimental methods for identifying transcription regulatory sites have been collected in TRANSFAC (Heinemeyer, 1998, 1999) which is the most complete and well maintained database on transcription factors, their genomic binding sites, and DNA-binding pro les. BrNazma et al. (1997) developed a general software tool to nd and analyze combinations of transcription factor binding sites that occur often in upstream gene regions in the yeast genome. In addition to analyzing the association rules in the combinations, their work focused on promoter and random regions, in which their ratio appears. To handle large amounts of data, data mining plays a prominent role in knowledge extraction. The enormous number of sequenced genomes, gene identi cation data, gene expression experimental pro les, and gene categorized in functional classes allows the use of computational techniques to investigate transcriptional regulatory elements in the gene promoter regions and to decipher the mechanisms of gene transcriptional regulation. Agrawal et al. (1993, 1994) introduced the problem of mining association rules. The rst part of this study is to describe a database of repetitive elements RSDB. The RSDB database contains repetitive elements, which are classi ed into three categories: exact, tandem, and similar. The interfaces needed to query and show the results and statistical data, such as the relationship between repetitive elements and genes, cross-references of repetitive elements among different organisms, and so on, are provided. The second part of this study initially identi es the combinations of known regulatory sites from TRANSFAC and overrepresented repetitive oligo-mers from RSDB in the promoter regions of a particular set of genes selected. The data mining approach, with its mining association rules, is then applied to mine the associations from the combinations of overrepresented repetitive elements and known regulatory sites. We prune the association rules statistically by chi-square testing to remove signi cant ones. Those repetitive sequences in the signi cant rules are candidates of putative regulatory sites. 2. SYSTEM ARCHITECTURE RSDB is built on a personal computer with two Intel Pentium III 800 processors and 1GB physical memory. The operating system RSDB uses is Red Hat Linux 6.2 and the version of Oracle DBMS is server/repbase.html 2

The relationship between REPEAT, REP_DETAIL, and SEQ tables in RSDB is illustrated in Fig.

3 THE REPETITIVE SEQUENCE DATABASE 623 FIG. 1. The relationship between repeats data. Table 1. The Amount of Repetitive Sequences in RSDB The contents of RSDB (24 organisms) length 7 Identical repeats Copies of repeats (Repeat) Sequences 71 million 218 million 59 million FIG. 2. System architecture of RSDB. The relationship between REPEAT, REP_DETAIL, and SEQ tables in RSDB is illustrated in Fig. 1 so that readers can more quickly understand what kinds of data is stored in RSDB and how they relate to one another. The amount of repetitive sequences in RSDB is given in Table 1. The details of the organisms are given in Appendix A. The system architecture is shown in Fig. 2 and the main components of RSDB are listed as follows. 1. Thin client (web browser). 2. Oracle DBMS. 3. Apache web server C PHP scripting engine. 4. Suf x Array daemon with organisms suf x array data structures. 5. Delta daemon with related hash tables.

4 624 HORNG ET AL Tools 3. TOOLS, FUNCTIONS, AND STATISTICS ON RSDB RSDB currently provides the following tools: ² Search By Feature ² Search By Range ² Search By Repeats Pattern ² Search By Nonrepeats Pattern ² Search By Accession Number or Sequence ID ² Search By Tandem Repeats ² Personal Query History The query Search By Feature allows users to view the chosen repeats by selecting their features. These features are length, copy, AT ratio, way, type, distribution, and scope. With the query Search By Range, the users can nd repeats in a certain range in a chromosome by specifying the beginning and end positions in the chromosomes. Upon receiving this query, RSDB will provide the number of identical repeats and total copies located in the speci ed range of the chromosome. RSDB will show the statistics for each 4k interval in the speci ed range. For example, when users want to know the number of repeats within the range from the rst bp to the 50,000th bp in the rst chromosome of C.E., RSDB can also provide the statistics for the ranges 1 st 4000 th ; 4001 st 8000 th ; st th, and so on. The query Search By Repeats Pattern is used to search for the patterns which are identical to the sequences referred by the repeats. The query Search By ID Sequence or Access Number is used to view or download the at les of repeats or sequences. That is, users can input either accession numbers or sequence IDs; display format customization is also provided. The query Search By Tandem Repeats, which is for nding tandem repeats, allows users to choose the features of the tandem repeats and nd the matched ones. The functionality of this tool is typically similar to Search By Feature Queries on repetitive elements containing a given sequence In RSDB, all the tools provided on the web are based on a variety of queries. But not all the queries are suitable for the query execution in DBMS. For example, if users submit a query to the Search By Repeats Pattern tool using regular expressions, such as ATT {3, 10} T, to describe the search condition, RSDB can t handle it. In addition, the repeat AAA will be removed if another repeat AAAT is found (that is, the former repeat is a substring of the latter) and AAAT occurs wherever AAA occurs. Because of this de nition, users cannot nd any shorter repeats, such as AAA, in this case. Of course, we could search all the repeat sequences in RSDB for this short pattern, but that would be time consuming. To solve these problems, our RSDB can return all the positions on the organism s chromosome where this pattern occurs to users immediately. Then users can use these positions to query RSDB again to nd the relationship between this pattern and other repeats. The processing is given in Fig. 3. In this example, the user would like to nd all repetitive elements in RSDB containing the sequence ATCGAA. The RSDB will locate the repetitive elements AAATCGAA, ATCGAAGGC, and ATCGAAGAG that contain the user s query pattern ATCGAA in a very short time. The users can get the positions immediately after submitting the query. Then the users can see how these positions are related to some repeats by clicking Show Related Repeats. As shown in Fig. 4 and Fig. 5, the 25 positions where the pattern occurs are included in 176 repeat copies, with the latter consisting of 172 identical repeats Querying repetitive elements in a given sequence In addition to searching for a nonrepeat pattern, users may want to input a long sequence to nd out the number of repeats existing within. Intuitively, this can be done by searching the whole RSDB for every subsequence of the given long sequence, but that approach takes a lot of time.

THE REPETITIVE SEQUENCE DATABASE 625 FIG. 3. Processing ow of Search By Nonrepeat Pattern. Thus, we have designed a mechanism to solve this problem more ef ciently.

It allows the user to nd how many repetitive elements exist in RSDB within a shorter period of time. Subsequences of the sequence must be searched in RSDB and this is time-consuming.

5 THE REPETITIVE SEQUENCE DATABASE 625 FIG. 3. Processing ow of Search By Nonrepeat Pattern. Thus, we have designed a mechanism to solve this problem more ef ciently. With this mechanism, RSDB can now provide a powerful tool similar to the famous BLAST (Altschul, 1990). It allows the user to nd how many repetitive elements exist in RSDB within a shorter period of time. Subsequences of the sequence must be searched in RSDB and this is time-consuming. However, our RSDB can answer the queries like this in a very short time. A user may submit two types of questions. First, a user would like to know whether there exist repetitive elements in a given sequence. Our algorithm to nd the repetitive elements can answer this type of question. Second, a user may ask how many repetitive elements exist in RSDB by giving a sequence. This question is to search for the patterns which are identical to any subsequences in the given sequence. As shown in Fig. 6, a biologist asks how many repetitive elements can be found in the RSDB by giving the sequence ATCGGA: : : ATAAC. The biologist only needs to enter the sequence and our RSDB will use the mechanism to return repetitive elements, such as the accession numbers CE and CE in this example, with much faster results. FIG. 4. Positions returned in Search By Nonrepeat Pattern.

626 HORNG ET AL. FIG. 5. Relationship between positions and repeats. FIG. 6. Querying repetitive elements from a given sequence ATCGGA: : : ATAAC. 3.4.

With this tool, RSDB will show how many genes and proteins in this organism each repeat relates to, and with which relationship (Form 1 10). One example is given in Fig. 8.

6 626 HORNG ET AL. FIG. 5. Relationship between positions and repeats. FIG. 6. Querying repetitive elements from a given sequence ATCGGA: : : ATAAC Relationship between repeats and genes Depending on user requirements, the relationship between repeats and genes (or proteins) can be de ned as 10 types, i.e., Form 1 Form 10, and the relationships are illustrated in Figure 7. With this tool, RSDB will show how many genes and proteins in this organism each repeat relates to, and with which relationship (Form 1 10). One example is given in Fig. 8. The interface to show the relationship between repetitive elements and genes is given in Figs FIG. 7. All forms between genes and repeats.

THE REPETITIVE SEQUENCE DATABASE 627 FIG. 8. Explanation of relationship between genes, proteins, and repeats. FIG. 9. Web pages for related genes and proteins for repeats (1/2). As shown in Figs.

But, in regard to the performance, RSDB can nd the relationship only between genes (or proteins) and repeats with less than 100 copies.

7 THE REPETITIVE SEQUENCE DATABASE 627 FIG. 8. Explanation of relationship between genes, proteins, and repeats. FIG. 9. Web pages for related genes and proteins for repeats (1/2). As shown in Figs. 9 11, RSDB shows the number of genes and proteins related to a certain repeat and how they correlate to each other. But, in regard to the performance, RSDB can nd the relationship only between genes (or proteins) and repeats with less than 100 copies. In addition to at les, more importantly, RSDB can show users the relationship between genes (or proteins) and repeats, as illustrated in Fig Statistics on RSDB Table 2 shows partial statistics for the number of identical repetitive elements, number of copies, and ratio of the ranges of different lengths. The ranges of different lengths are divided into ve ranges, i.e.: 7 10, 11 30, , 101 1; 000, and > 1; 000. The de nition of identical repetitive elements and element copies repetitive is given in Fig. 1. As to the ratio in the third column, it is similar to the de nition of percent of genome in the work Li (2001), which means the percentage of repetitive elements in an organism. The complete statistical data can be found at Table 3 shows partial results of the cross-references of repetitive elements in different organisms. For example, there is a surprisingly large number of repetitive elements, i.e., 1,287,275, in the organisms

8 628 HORNG ET AL. FIG. 10. Web pages for related genes and proteins for repeats (2/2). FIG. 11. Information about related genes and proteins in at les.

9 THE REPETITIVE SEQUENCE DATABASE 629 Table 2. Repetitive Elements in the Organisms Listed by Different Length Regions Name Identical Copies % Identical Copies % Identical Copies % Bacillus subtilis % % % Haemophilus in uenzae Rd % % % Helicobacter pylori J % % % Helicobacter pylori % % % Mycoplasma genitalium % % % Mycobacterium tuberculosis H37Rv % % % Escherichia coli % % % Aquifex aeolicus % % %

10 630 HORNG ET AL. Table 3. Cross-References of Repetitive Elements in Different Organisms Table 4. The Abbreviation of the Organisms ID Name 1 Caenorhabditis elegans 2 Homo sapiens 3 Saccharomyces cerevisiae 4 Bacillus subtilis 5 Haemophilus in uenzae Rd 6 Helicobacter pylori J99 7 Helicobacter pylori Mycoplasma genitalium Table 5. The Longest Length and the Maximum Copies of Repetitive Elements Name Longest length Max copies Caenorhabditis elegans Homo sapiens Saccharomyces cerevisiae Bacillus subtilis Haemophilus in uenzae Rd Helicobacter pylori J Helicobacter pylori Mycoplasma genitalium Mycobacterium tuberculosis H37Rv Escherichia coli Aquifex aeolicus Aeropyrum pernix K Caenorhabditis elegans and Homo sapiens. Table 4 gives the ID and its species for each organism. We believe the statistics reveals an interesting nding. Table 5 shows the longest length of repetitive elements and the maximum copies of repetitive elements in organisms. 4. APPLICATIONS OF REPETITIVE ELEMENTS IN RSDB Li et al. (2001) have analyzed the draft human genome sequence for data related to evolutionary genomics and they reveal new information about repetitive elements (i.e., SINEs and LINEs), domain sharing and conservation, and gene duplication in the human genome. Their analysis has provided some evolutionary

11 THE REPETITIVE SEQUENCE DATABASE 631 insights into the human genome. There are many repetitive elements in complete genomes in our RSDB, and they would be very important features in the evolutionary genomics. The application (Horng, 2001a) attempts to discover putative regulatory elements to investigate how combinations of the known regulatory sites and over-represented repetitive elements are distributed in the promoter regions of groups of functionally related genes. We present the application in detail in next section. We had experimented on C. elegans from Sanger and NCBI to nd the longest repetitive element. Theoretically, two copies of the longest repetitive element should be identical if the sequences in Sanger and NCBI are assembled correctly. However, not only had we found this to be incorrect, but also the difference between them is large. The sequence in Sanger is corrected when we access it again. Similarly, we observed that the sequence C. elegans differs for copies in different time periods, i.e., 1998, 1999, 2000, and The longer repetitive elements disappeared year-by-year. This seems to reveal errors in fragment assembly if many long repetitive elements are found in a sequence. Another application of repetitive elements in RSDB is the design of primers and markers to extract the fragment sequence from the sequenced genomes. The basic idea is that the primer and marker sequences have the property of being nonrepetitive in contrast to the repetitive sequences in genomes. To design primers and markers in a genome, all repetitive elements are marked in the genome rst, and those regions that are not covered by any repetitive elements are nonrepetitive sequences and are candidates for primers or markers. In addition, a pair of candidate sequences of primers or markers extracted from those nonrepetitive regions can be used to tailor the different lengths of target sequences such as genes, STR, SINEs, LINEs, and so on Approach 5. MINING PUTATIVE BINDING SITES USING REPETITIVE ELEMENTS IN RSDB The proposed approach is given as follows. We rst preprocess to nd the combinations of known regulatory sites and overrepresented repetitive oligomers located in the promoter regions of the groups of functionally related genes. Next, Apriori and AprioriTid (Agrawal, 1993, 1994) are applied to mine the association rules by combining the known regulatory sites and overrepresented repetitive oligomers. Chi-square is then used to select certain rules. Finally, the overrepresented repetitive oligomers in the association rules are selected as putative regulatory elements Materials Before analysis of the associations of known regulatory sites and overrepresented repetitive sequences located in promoter regions, the whole sequence of target genome and the gene annotations are obtained from NCBI. The experimental identifying transcription factor binding sites can be obtained from TRANSFAC. TRANSFAC database (Heinemeyer, 1998, release 4.0) contains 4,965 site sequences, and 2,837 factor entries. Most sites are also consensus patterns. The data in TRANSFAC has the following features. A transcription factor binding site accession number may have different consensus sequences. Different binding site accession numbers may have the same consensus sequence. Wild characters such as M or W used in TRANSFAC make the sequences cover other sequences. Small consensus sequences may appear in larger ones. The repetitive sequence of the target genome, i.e., yeast, can be obtained from the repetitive sequence database (RSDB) (Horng, 2001b). In MIPS, 6,350 Yeast genes and ORFs are documented (Mewes, 1999), and 3,529 genes are classi ed into at least one functional catalogue Statistical analysis of overly-represented repetitive oligonucleotides Nucleotide succession is not random, and some oligonucleotides are clearly overrepresented, noticeably the poly (A), poly (T), and poly (AT) chains. An additional bias results from the fact that oligonucleotides are differently represented in coding region versus noncoding sequences (van Helden et al., 1998). Thus, a speci c expected frequency has to be used for each oligonucleotide sequence. Van Helden et al. (1998) proposed a statistical method to estimate the probability to observe exactly n occurrences of the oligo-

12 632 HORNG ET AL. nucleotide b within promoter regions of a gene family by the bionomial formula. The values with the highest probability are the most overrepresented oligomers. The advantage of the signi cance coef cient is that its threshold can be selected and its values interpreted independently of oligonucleotide size, upstream sequence size, and number of genes within the family. The overrepresented repetitive sequences of yeast are obtained by applying the statistical method mentioned by van Helden et al. (1998). We set the threshold to 0. The repetitive sequences are selected as signi cant overrepresented repetitive oligomers if their coef cients are larger than the threshold Preprocessing and mapping The transcription factor binding sites in TRANSFAC and repetitive elements in RSDB are rst prepared. The known regulatory sites and repetitive sequences are then located in the promoter regions of groups of functionally related genes. Combinations of the known regulatory sites and overrepresented repetitive sequences located within the promoter regions are found. For more details about the preprocessing, the reader may refer to Horng (2001a). In the following, we describe how to mine associations from the combinations of the transcription factor binding sites and overrepresented repetitive sequences found. Consider a large database with transactions, where each transaction consists of a set of items. An association rule is an expression such as A ) B, where A and B are the sets of items. The mining of an association rule is that a transaction in the database that contains A also tends to contain B. For example, 90% of the people who purchase beer also purchase diapers. Herein, 90% is called the con dence of the rule. The support of the rule A ) B given herein is the percentage of transactions that contain both A and B. The formal statement of the problem is described below. Let I D fi 1 ; i 2 ; : : : ; i m g be a set of sites, called item set. Let D be a set of repeat sequences, where each repeat sequence S corresponding to a transaction contains a set of items such that S µ I. Let S D fs 1 ; s 2 ; : : : ; s m g be a set of transcription factor binding sites in TRANSFAC and R D fr 1 ; r 2 ; : : : ; r n g be a set of overrepresented repetitive sequences from RSDB. The union of the sets S and R is called the item set. Let G D fg 1 ; g 2 ; : : : ; g m g be a group of functionally related genes. Each promoter region of a gene corresponding to a transaction contains a set of transcription factor binding sites and overrepresented repeats, also called items. Assume that a promoter region S contains A, a set of items of I, if A µ S. An association rule is an implicate of the form A ) B, where A ½ I, B ½ I, and A \ B D ;. The rule A ) B holds in the set of promoter regions D with con dence conf if c% of transactions in D contain A and also B. The rule A ) B has support sup in the repetitive sequence set D if s% of promoter regions in D contained A [ B. In our experiments, the minimum support is set to 10%. The association rules are generated if the rule has a higher support and con dence speci ed by user. Apriori and AprioriTid (Agrawal, 1993, 1994) are then applied to mine association rules Results We apply the proposed method to Saccharomyces cerevisiae. The genome sequence and gene information are obtained from GenBank. The functional catalogues of Yeast ORFs are collected in MIPS (Mewes, 1999). The known regulatory sites are obtained from TRANSFAC (Heinemeyer, 1998) and the repetitive sequences of yeast genome are obtained from RSDB (Horng, 2001b). Table 6 shows the numbers of known regulatory sites and overrepresented repetitive elements in promoter regions related to genes of functional catalogues. Table 7 shows the associations mined by our proposed approach. The description of functional catalogues and ORFs in these catalogues is given in Appendixes B and C. The rst column in Table 7 is the functional catalogue from MIPS; the second one denotes the numerical format of functional categories; the third column is the number of genes in the corresponding family; the fourth one is the number of associations discovered by data mining; the fth one is the maximum number of items in a transaction; the sixth one is the average number of items in a transaction; the last one is the number of signi cant associations after applying the Chi-square test. The minimum support and con dence are set to 60%. The notation of the associations is given in Fig. 12. Some interesting and signi cant association rules in each functional catalogue are shown in Table 8. The rst column in Table 8 is the functional catalogues from MIPS; the second one is the associations containing known regulatory sites in uppercase and over-represented repetitive oligomers in lowercase; the third one is the con dence of the association; the

13 THE REPETITIVE SEQUENCE DATABASE 633 Table 6. The Amounts of Known Regulatory Sites and Over-Represented Repetitive Elements Related to Groups of Functionally Related Genes Functional No. of related No. of related known Functional catalogues over-represented regulatory sites catalogues in MIPS in numeric repetitive elements in TRANSFAC Glycolysis and gluconeogenesis DNA synthesis and replication Ribosomal RNA synthesis trna synthesis Ion transporters Transport ATPases ABC transporters Drug transporters Table 7. The Mined Association Rules After Pruning by Chi-Square No. of No. of Functional associations Average signi cant Functional catalogues No. of before Max no. no. of association catalogues in MIPS in numeric genes pruning of items of items rules Glycolysis and gluconeogenesis DNA synthesis and replication Ribosomal RNA synthesis trna synthesis Ion transporters Transport ATPases ABC transporters Drug transporters FIG. 12. An illustrative example of prediction of putative regulatory sites.

14 Table 8. Partial Signi cant Associations Mined in Each Functional Catalogue Functional classi cation catalogues in MIPS Associations Conf Sup Â 2 Assocations Glycolysis and gluconeogenesis AAGGAA ) acatat c-ets-2 ) acatat CTATC ) atgcaa NF-E ) atgcaa Atggaa, CTAAT, TCTCC, TTCAAA ) aaaaac atggaa, unknown, ADR1, TFIID ) aaaaac AAGGAA, atggaa, TGGCA, TTCAAA ) taagaa c-ets-2, atggaa, NF-1/L, TFIID ) taagaa DNA synthesis and replication GCAAT ) attgca unknown ) attgca GCAAT ) attatg unknown ) attatg CTAAT ) attcac unknown ) attcac CTAAT ) tgtgaa unknown ) tgtgaa trna synthesis AGAGG, cttcta ) CTGTC unknown, cttcta ) NF-E CATCC, TATAT, TGGCA ) aaatta GCR1, unknown, NF-1=L ) aaatta AAGGAA, AGAGG, ATTGG, CCAAT, cttcta, GATAA, c-ets-2, unknown, unknown, SRF, cttcta, GAL4, NGGRGK, TATAT, TTATC ) TTCCTT NGGRGK, unknown, DBF-A ) c-ets-2 aaaatt, AAGGAA, ATTGG, CATCC, CCAAT, CTATC, NGGRGK, TATAT, ttaaaa ) TTCCTT aaaatt, c-ets-2, unknown, GCR1, SRF, NF-E, NGGRGK, unknown, ttaaaa ) c-ets-2

15 THE REPETITIVE SEQUENCE DATABASE 635 Table 9. The Occurrences of the Association (NF-E ) atgcaa) in the Functional Catalogue Glycolysis and Gluconeogenesis ORFs YBR218c YBR221c YCR012w YDR050c YER178w YGL253w YGR192c YGR240c YGR254w YHR174w YJL052w YJR009c YKL060c YKL152c YKR043c YLR345w YMR280c YMR323w YOL056w YOR283w YOR393w YPL281c Associations (NF-E ) atgcaa) [-896]-CTATC-[347]-CTATC-[122]-atgcaa-[337]-atgcaa- [-720]-atgcaa-[100]-CTATC-[44]-CTATC-[281]-atgcaa-[29]-CTATC- [-678]-CTATC-[535]-atgcaa-[16]-atgcaa- [-789]-atgcaa-[598]-CTATC- [-487]-CTATC-[15]-atgcaa- [-772]-atgcaa-[586]-CTATC-[97]-CTATC- [-751]-CTATC-[53]-atgcaa-[603]-atgcaa-[24]-atgcaa- [-492]-atgcaa-[84]-atgcaa-[208]-CTATC- [-474]-CTATC-[137]-atgcaa- [-777]-CTATC-[166]-CTATC-[226]-atgcaa-[132]-CTATC- [-670]-CTATC-[40]-CTATC-[536]-atgcaa- [-895]-CTATC-[75]-CTATC-[391]-atgcaa- [-239]-atgcaa-[98]-CTATC- [-910]-CTATC-[385]-CTATC-[100]-atgcaa-[94]-CTATC- [-970]-CTATC-[191]-atgcaa-[10]-CTATC-[15]-CTATC- [-663]-CTATC-[545]-CTATC-[68]-atgcaa- [-498]-CTATC-[320]-CTATC-[93]-atgcaa-[38]-CTATC- [-759]-CTATC-[52]-atgcaa-[291]-atgcaa-[282]-atgcaa- [-768]-atgcaa-[16]-CTATC-[149]-atgcaa-[54]-CTATC-[168]-CTATC- [-868]-CTATC-[56]-atgcaa- [-767]-CTATC-[52]-atgcaa- [-292]-atgcaa-[53]-CTATCfourth one is the support value; the fth one is the Chi-square value; the last one is similar to the second one except for the use of transcription factor names. The putative regulatory elements of the functional catalogue Glycolysis and gluconeogenesis mined by our proposed method are listed in Appendix D. Table 9 shows an example of the occurrences of the association [NF E ) atgcaa] in the functional catalogue Glycolysis and gluconeogenesis. The rst column in Table 9 denotes the name of the yeast ORF; the second one is the positions of known regulatory sites or putative regulatory sites in the promoter region. For example, the rst row in Table 9, YBR218, is the ORFs name, and [-896]- CTATC- [347]- CTATC- [122]- atgcaa-[337]-atgcaa are the compositions of the associations. The rst number, [-896], denotes the offset of the site CTATC from the start position of coding region, and the distance between the rst and second occurrences is denoted as [347] nucleotides. Table 10 shows an example of the occurrences of the association, [unknown; cttcta ) NF E], in the functional catalogue trna synthesis Observations We further observe the repetitive property of the transcription factor binding sites in TRANSFAC (Heinemeyer, 1998) for the yeast genome. Table 11 shows the occurrences of partial transcription factor binding sites in TRANSFAC (Heinemeyer, 1998) in Yeast Genome. For example, the rst site, TATATAAT, appears 1,691 times in the yeast genome. The site appears 418 times in the upstreams of genes and 656 times in genes. We further nd 679 genes contain the site. That is, the site perhaps appears in upstreams, or genes, or both. From Table 11, we nd that our repetitive elements in RSDB are candidates of putative regulatory elements, and the data in the Table 11, seems to indicate that our approach is promising. 6. CONCLUSIONS Around 43% of the human genome is occupied by repetitive elements. Moreover, around 51% of the rice genome is occupied by repetitive elements. The analysis of repetitive elements reveals repetitive elements in our genome may have been very important in the evolutionary genomics.

16 636 HORNG ET AL. Table 10. The Occurrences of the Association (Unknown a, cttcta, and NF-E) in Functional Catalogue-tRNA Synthesis ORFs YBR123c YDL150w YDR362c YER148w YGL019w YHR143w YKL144c YNL039w YNR003c YOR110w YOR116c YOR210w YPR186c Associations (unknown, cttcta, and NF-E) [-906]-CTGTC-[61]-AGAGG-[740]-cttcta-[21]-AGAGG-[68]-CTGTC- [-581]-AGAGG-[6]-AGAGG-[81]-cttcta-[32]-CTGTC-[43]-cttcta-[58]-AGAGG-[257]-AGAGG- [4]-CTGTC- [-766]-cttcta-[19]-CTGTC-[37]-CTGTC-[7]-AGAGG-[49]-CTGTC-[422]-cttcta-[47]-AGAGG- [142]-CTGTC- [-841]-CTGTC-[55]-cttcta-[374]-AGAGG- [-970]-CTGTC-[20]-AGAGG-[200]-AGAGG-[507]-CTGTC-[207]-cttcta- [-985]-AGAGG-[15]-AGAGG-[3]-cttcta-[54]-cttcta-[78]-CTGTC-[175]-CTGTC-[148]-AGAGG- [-613]-AGAGG-[21]-CTGTC-[308]-AGAGG-[57]-AGAGG-[35]-AGAGG-[17]-cttcta- [-952]-CTGTC-[269]-AGAGG-[113]-cttcta-[388]-AGAGG-[95]-AGAGG-[47]-cttcta- [-493]-cttcta-[123]-AGAGG-[43]-cttcta-[166]-CTGTC- [-890]-CTGTC-[59]-CTGTC-[279]-cttcta-[148]-AGAGG- [-856]-CTGTC-[124]-AGAGG-[151]-cttcta-[423]-AGAGG- [-799]-AGAGG-[136]-CTGTC-[117]-cttcta- [-757]-AGAGG-[26]-CTGTC-[81]-AGAGG-[296]-cttcta-[106]-AGAGG-[15]-CTGTCa The patterns are found in TRANSFAC with unknown name. The contributions of the rst part in this study are brie y stated below. We develop a database for repetitive elements and provide users with not only query tools but also more useful and maybe signi cant statistics. Some of the statistics are related to the entire data in RSDB, and some are particularly related to the data which match users queries. We hope that these statistics can help biologists discover a whole new world in biology some day. RSDB not only stores data of repeats, but also integrates some data of genes into this system. This makes RSDB more comprehensive. RSDB also provides useful tools such as BLAST to query repetitive elements in RSDB from a given sequence or to nd repetitive elements containing a given sequence. Table 11. The Occurrences of Partial Transcription Factor Binding Sites in TRANSAC (Heinemeyer, 1998) in Yeast Genome Occurrences (times) No. of genes Sites in Accession Site ID in containing TRANSFAC number TRANSFAC Genome Upstreams Genes the site TATATAAT R01732 AD$E1B_ AGAACA R00973 CHICK$LYS_ TATAAAA R00047 DROME$AC5C_ AAGGAA R04338 HS$CDC2_ AAGGAA R04338 HS$CDC2_ AAGTGA R00917 HS$IFNB_ GGCGCG R01772 HSV1$TK_ TAAAAAA R04010 MOUSE$ADA_ TTCCTC R04413 MOUSE$FCGR3A_ GCCAAT R03384 MOUSE$JUND_ TATAAGA R03388 MOUSE$JUND_ TTCAAA R02749 MOUSE$MBP_ TTTAAA R01598 MOUSE$WAP_ ATAAATA R01371 SV$SV40_ TTTTTTG R01566 VIV$VISNA_ TTTTTTG R01566 VIV$VISNA_ AAGTACAT R01361 Y$STE2_

17 THE REPETITIVE SEQUENCE DATABASE 637 The contribution of the second part of this study is to nd combinations of known regulatory sites and overrepresented repetitive oligomers located within the promoter regions of groups of functionally related genes. The data mining techniques are then applied to mine the associations. Our proposed approach can mine putative regulatory elements of any complete genome, such as yeast, in this study. The parameters to identify overrepresented repetitive sequences within promoter regions of genes can be speci ed by users according to their needs. The discovered associations of known and putative regulatory elements can also provide effective information to researchers studying in the mechanisms of gene transcriptional regulation. APPENDIX A: THE AMOUNT OF REPETITIVE ELEMENTS IN THE ORGANISMS Organism Id Organism name Identical repeat Total copies 1 Caenorhabditis elegans Homo sapiens Saccharomyces cerevisiae Bacillus subtilis Haemophilus in uenzae Rd Helicobacter pylori J Helicobacter pylori Mycoplasma genitalium Mycobacterium tuberculosis H37Rv Escherichia coli Aquifex aeolicus Aeropyrum pernix K Archaeoglobus fulgidus Chlamydia pneumoniae AR Chlamydia trachomatis Mycoplasma pneumoniae M Pyrococcus horikoshii OT Rickettsia prowazekii strain Madrid E Synechocystis PCC Thermotoga maritima Treponema pallidum subsp. pallidum Ureaplasma urealyticum Pyrococcus abyssi Arabidopsis thaliana APPENDIX B: THE FUNCTIONAL CATALOGUES AND THEIR ORFS STUDIED IN THIS WORK (FIRST GROUP) Functional Numerical catalogues in MIPS format ORFs Glycolysis and gluconeogenesis YAL038w, YBR196c, YBR218c, YBR221c, YCL040w, YCR012w, YDL021w, YDR050c, YER065c, YER178w, YGL062w, YGL253w, YGR192c, YGR193c, YGR240c, YGR254w, YHR174w, YJL052w, YJR009c, YKL060c, YKL152c, YKR043c, YKR097w, YLR345w, YLR377c, YMR205c, YMR280c, YMR323w, YNL071w, YOL056w, YOR283w, YOR347c, YOR393w, YPL281c (continued)

18 638 HORNG ET AL. APPENDIX B: (Continued) Functional Numerical catalogues in MIPS format ORFs DNA synthesis and replication YAL040c, YAR007c, YBL023c, YBL035c, YBR060c, YBR087w, YBR088c, YBR160w, YBR195c, YBR202w, YBR278w, YCR028c-a, YCR077c, YDL017w, YDL102w, YDL164c, YDR052c, YDR054c, YDR068w, YDR110w, YDR206w, YDR364c, YDR499w, YEL032w, YER070w, YER176w, YFL036w, YFR027w, YFR028c, YGL058w, YGL201c, YGR109c, YGR132c, YGR180c, YGR231c, YHR118c, YHR164c, YIL036w, YIL066c, YIL139c, YIL150c, YIR008c, YJL026w, YJL065c, YJL090c, YJL173c, YJL194w, YJR006w, YJR043c, YJR046w, YJR068w, YKL045w, YKL108w, YKL112w, YKL113c, YLL004w, YLR103c, YLR182w, YLR233c, YLR274w, YLR369w, YML058w, YML065w, YML102w, YMR001c, YMR072w, YMR241w, YMR284w, YNL088w, YNL102w, YNL213c, YNL218w, YNL261w, YNL262w, YNL290w, YNL312w, YOL006c, YOL094c, YOL095c, YOL115w, YOR217w, YOR330c, YPL167c, YPL256c, YPR018w, YPR019w, YPR120c, YPR135w, YPR162c, YPR175w rrna synthesis YAL001c, YBL014c, YBL025w, YBR049c, YBR123c, YBR154c, YDL150w, YDR156w, YDR362c, YER148w, YGR047c, YGR246c, YHR143wa, YJL025w, YJL148w, YJR063w, YKL125w, YKL144c, YLR039c, YLR141w, YML043c, YMR270c, YNL039w, YNL113w, YNL151c, YNL248c, YNR003c, YOR110w, YOR116c, YOR207c, YOR210w, YOR224c, YOR340c, YOR341w, YPR010c, YPR110c, YPR186c, YPR187w, YPR190c trna synthesis YAL001c, YBR123c, YBR154c, YDL150w, YDR362c, YER148w, YGL019w, YGR047c, YGR246c, YHR143wa, YKL144c, YNL039w, YNL113w, YNL151c, YNR003c, YOR110w, YOR116c, YOR207c, YOR210w, YOR224c, YPR110c, YPR186c, YPR187w, YPR190c APPENDIX C: THE FUNCTIONAL CATALOGUES AND THEIR ORFS STUDIED IN THIS WORK (SECOND GROUP) Functional catalogues Numerical in MIPS format ORFs Ion transporters YAL026c, YBR127c, YBR207w, YBR235w, YBR291c, YBR294w, YBR295w, YBR296c, YCR024c-a, YCR037c, YDL128w, YDL185w, YDR011w, YDR270w, YDR456w, YEL017c-a, YEL027w, YEL031w, YEL051w, YEL065w, YER053c, YER145c, YFL041w, YFL050c, YGL006w, YGL008c, YGL167c, YGL255w, YGR020c, YGR065c, YGR121c, YGR191w, YHL016c, YHR026w, YHR039c-a, YHR175w, YJL094c, YJL117w, YJL129c, YJL198w, YJR040w, YJR077c, YKL080w, YKL120w, YKR050w, YLR092w, YLR130c, YLR138w, YLR348c, YLR411w, YLR447c, YML123c, YMR054w, YMR058w, YMR243c, YMR319c, YNL142w, YNL259c, YNR013c, YNR039c, YOL122c, YOL130w, YOR153w, YOR270c, YOR316c, YOR332w, YPL036w, YPL176c, YPL224c, YPL234c, YPR003c, YPR036w, YPR124w, YPR138c, YPR201w (continued)

19 THE REPETITIVE SEQUENCE DATABASE 639 APPENDIX C: (Continued) Functional catalogues Numerical in MIPS format ORFs Transport ATPases Q0080, Q0085, Q0130, YAL026c, YBL099w, YBR039w, YBR127c, YBR295w, YCR024c-a, YDL004w, YDL185w, YDR038c, YDR039c, YDR040c, YDR093w, YDR270w, YDR298c, YEL017c-a, YEL027w, YEL031w, YEL051w, YER166w, YGL006w, YGL008c, YGL167c, YGR020c, YHR026w, YHR039c-a, YIL048w, YJR121w, YKL016c, YKL080w, YLR295c, YLR447c, YMR054w, YMR162c, YOR270c, YOR332w, YPL036w, YPL078c, YPL234c, YPL271w, YPR036w ABC transporters YCR011c, YDR011w, YDR091c, YDR135c, YDR406w, YER036c, YFR009w, YGR281w, YHL035c, YIL013c, YKL188c, YKL209c, YKR103w, YKR104w, YLL015w, YLL048c, YLR188w, YMR301c, YNL014w, YNR070w, YOL075c, YOR011w, YOR153w, YOR328w, YPL058c, YPL147w, YPL226w, YPL270w Drug transporters YBR008c, YBR043c, YBR052c, YBR180w, YBR293w, YCL069w, YCL073c, YDR119w, YEL065w, YGR138c, YGR197c, YGR224w, YGR281w, YHL040c, YHL047c, YHR032w, YHR048w, YIL013c, YIL048w, YIL120w, YIL121w, YKR105c, YKR106w, YLL028w, YML116w, YMR088c, YMR123w, YMR279c, YNL065w, YNR055c, YOL158c, YOR273c, YOR378w, YPR156c, YPR198w APPENDIX D: THE ASSOCIATION OF REGULATORY SITES MINED IN THE CATALOGUE OF GLYCOLYSIS AND GLUCONEOGENESIS Functional classi cation catalogues Glycolysis and gluconeogenesis ORFs Associations Conf. Sup. Â 2 Associations AAGGAA ) acatat c-ets-2 ) acatat CTATC ) atgcaa NF-E ) atgcaa AAGGAA, CCAAT ) acaaaa c-ets-2,srf ) acaaaa AGAGG, AGGAAA ) aaaatc unknown,pea3 ) aaaatc AGAGG, GTCAC, TGACG ) CTATC unknown, unknown, CREB ) NF-E AGAGG, TGACG ) CTATC unknown, CREB ) NF-E AGAGG, ggaaaa ) aaaatc unknown, ggaaaa ) aaaatc ATGAAAA ) aaaaac Pit-1a ) aaaaac atgcaa ) CTATC atgcaa ) NF-E atgcaa ) GCAAT atgcaa ) unknown atggaa, CTAAT, TCTCC, TTCAAA ) aaaaac atggaa, unknown, ADR1, TFIID ) aaaaac TGGCA ) atggaa NF-1/L ) atggaa CTAAT ) TAAAAT unknown ) Pit-1a/F2F CTAAT ) TCTCC unknown ) ADR1 AAGGAA, TGGCA, TTCAAA ) c-ets-2, NF-1/L, TFIID ) taagaa taagaa AAGGAA, acaaaa ) CCAAT c-ets-2, acaaaa ) SRF AAGGAA, atggaa, TGGCA, TTCAAA ) taagaa c-ets-2, atggaa, NF-1/L, TFIID ) taagaa

20 640 HORNG ET AL. ACKNOWLEDGMENTS The authors would like to thank the National Science Council of the Republic of China and the Asia Bioinnovation Corporation for nancial support of this research. Prof. Ueng-Cheng Yang and Dr. Yu-Chung Chang are appreciated for their valuable discussion regarding molecular biology. Prof. Cheng-Yan Kao is also commended for his suggestions regarding our database. REFERENCES Agrawal, R., Imielinski, T., and Swami, A Mining associations between sets of items in large databases. Proc. ACM SIGMOD Int. Conf. Management of Data, Washington D.C., Agrawal, R., and Srikant, R Fast algorithms for mining association rules, Proc. 20th Int. Conf. Very Large Databases, Santiago, Chile, Sept. Expanded version available as IBM Research Report RJ9839, Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J Basic local alignment search tool. J. Mol. Biol. 215, Benson, D.A., Boguski, M.S., Lipman, D.J., Ostell, J., and Ouellette, B.F B.A. Rapp, and D.L. Wheeler, GenBank. Nucl. Acids Res. 27, Br Nazma, A., Jonassen, I., Vilo, J., and Ukkonen, E Predicting gene regulatory elements in silico on a genomic scale. Genome Res. 8, Br Nazma, A., Vilo, J., Ukkonen, E., and Valtonen, K Data mining for regulatory elements in yeast genome, Proc. 5th Int. Conf. Intelligent Systems for Molecular Biology. AAAI Press, De Risi, J.L., Iyer, V.R., and Brown, P.O Exploring the metabolic and genetic control of gene expression on a genomic scale. Science 278, Heinemeyer, T., Chen, X., Karas, H., Kel, A.E., Kel, O.V., Liebich, I., Meinhardt, T., Reuter, I., Schacherer, F., and Wingender, E Expanding the TRANSFAC database towards an expert system of regulatory molecular mechanisms. Nucl. Acids Res. 27, Heinemeyer, T., Wingender, E., Reuter, I., Hermjakob, H., Kel, A.E., Kel, O.V., Ignatieva, E.V., Ananko, E.A., Podkolodnaya, O.A., Kolpakov, F.A., Podkolodny, N.L., and Kolchanov, N.A Databases on transcriptional regulation: TRANSFAC, TRRD and COMPEL. Nucl. Acids Res. 26, Horng, J.T., and Cho, W.F Predicting regulatory elements in repetitive sequence using transcription factor binding sites. Electronic J. Biotechnology 3, 3, Dec. 15. Horng, J.T., Huang, H.D., Huang, C.C., and Kao, C.Y. 2001a. Mining putative regulatory elements in gene promoter regions. Proc. German Conference on Bioinformatics Horng, J.T., Lin, J.H., and Kao, C.Y. 2001b. RSDB-A database of repetitive elements in complete genomes. Proc. Atlantic Symposium on Computational Biology and Genome Information Systems and Technology, Durham, NC, Jurka, J Repeats in genomic DNA: Mining and meaning. Curr. Opin. Struct. Biol. 8, Li, W.H., Gu, Z., Wang, H., and Nekrutenko, A Evolutionary analyses of the human genome. Nature 409, Mewes, H.W., Heumann, K., Kaps, A., Mayer, K., Pfeiffer, F., Stocker, S., and Frishman, D MIPS: A database for protein sequences and complete genomes. Nucl. Acids Res. 27, Sargent, R., Fuhrman, D., Critchlow, T., Sera, T.D., Mecklenburg, R., Lindstrom, G., Schuler, G.D., Epstein, J.A., Ohkawa, H., and Kans, J.A., Entrez: Molecular biology database and retrieval system. Methods Enzymol. 266, van Helden, J., Rios, A.F., and Collado-Vides, J Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies, J. Mol. Biol. 281, Address correspondence to: Jorng-Tzong Horng Department of Computer Science and Information Engineering National Central University No. 38, Wu-chuan Li Chung-li, 320, Taiwan horng@db.csie.ncu.edu.tw

Using graphs to relate expression data and protein-protein interaction data

Using graphs to relate expression data and protein-protein interaction data R. Gentleman and D. Scholtens October 31, 2017 Introduction In Ge et al. (2001) the authors consider an interesting question.