The Repetitive Sequence Database and Mining Putative Regulatory Elements in Gene Promoter Regions ABSTRACT

Size: px
Start display at page:

Download "The Repetitive Sequence Database and Mining Putative Regulatory Elements in Gene Promoter Regions ABSTRACT"

Transcription

1 JOURNAL OF COMPUTATIONAL BIOLOGY Volume 9, Number 4, 2002 Mary Ann Liebert, Inc. Pp The Repetitive Sequence Database and Mining Putative Regulatory Elements in Gene Promoter Regions JORNG-TZONG HORNG, 1;2 HSIEN-DA HUANG, 1 MING-HUI JIN, 1 LI-CHENG WU, 1 and SHIR-LY HUANG 2 ABSTRACT At least 43% of the human genome is occupied by repetitive elements. Moreover, around 51% of the rice genome is occupied by repetitive elements. The analysis of repetitive elements reveals that repetitive elements in our genome may have been very important in the evolutionary genomics. The rst part of this study is to describe a database of repetitive elements RSDB. The RSDB database contains repetitive elements, which are classi ed into the following categories: exact, tandem, and similar. The interfaces needed to query and show the results and statistical data, such as the relationship between repetitive elements and genes, cross-references of repetitive elements among different organisms, and so on, are provided. The second part of this study then attempts to mine the putative binding site for information on how combinations of the known regulatory sites and overrepresented repetitive elements in RSDB are distributed in the promoter regions of groups of functionally related genes. The overrepresented repetitive elements appearing in the associations are possible transcription factor binding sites. Our proposed approach is applied to Saccharomyces cerevisiae and the promoter regions of Yeast ORFs. The complete contents of RSDB and partial putative binding sites are available to the public at The readers may download partial query results. Key words: database, DNA, data mining, repetitive elements, genes. 1. INTRODUCTION The work of Li (2001) estimated that at least 43% of the human genome is occupied by four major classes of interspersed repetitive elements, i.e., LINEs, SINEs, LTR elements, and DNA transposons. The analysis in this work has provided some insights into the evolutionary genomes of the human genome. There are many repetitive elements in our genome, and they may have been very important in the evolutionary genomics. 1 Department of Computer Science and Information Engineering, National Central University, Taiwan. 2 Department of Life Science, National Central University, Taiwan. 621

2 622 HORNG ET AL. RepBase 1 (Jurka, 1998) and STRBase 2 are two representative databases of repeat sequences. RepBase is a collection of repeat sequences, and its goal is to collect, review, and systematically organize representative examples or consensus sequences of repetitive elements from all eukaryotic species. Analysis of the repetitive DNA sequences in this work (Jurka, 1998), combined with experimental research, reveals a history of complex intracellular ecosystems of transposable elements that are inseparably associated with genomic evolution. STRBase stores microsatellites (short tandem repeats, STRs) which contains 2 5 bp (base pair) repeats. These repeats are widespread throughout the human genome and show suf cient variability among individuals in a population that they have become important in several elds including genetic mapping, linkage analysis, and human identity testing. Identi cation of transcriptional regulatory elements within promoter regions is very much of interest to biologists these elements govern the regulation of gene expression. Transcription factors, which are proteins, play a major role in gene regulation in eukaryotic organisms. The factors can bind to speci c sites, transcription factor binding sites or regulatory sites, in the promoter region of particular genes and interact with RNA polymerase and other factors to regulate the transcription of gene expression. Experimentally regulatory pro les of known and unknown genes can be determined at a genomic scale, thanks to the new technologies such as DNA microarray technology (van Helden, 1998). De Risi et al. (1997) have studied the diauxic shift in yeast and found several distinct gene groups by clustering the gene expression pro les. De Risi et al. (1997) show the presence of several regulatory sites in promoter regions of the respective genes. Van Helden et al. (1998) also studied the dataset constructed by De Risi et al. (1997) and systematically searched in promoter region of potentially co-regulated genes for overrepresented oligonucleotides which are possible transcription factor binding sites and may involve gene regulation. A systematic analysis of over-represented sequence patterns in clusters of promoter regions obtained by clustering the diauxic shift expression pro les has been done by BrNazma et al. (1997, 1998). It was shown that overrepresented patterns occurrence in promoter regions for genes from expression pro le clusters of at least 25 genes cannot be explained by statistical chance. Many experimental methods for identifying transcription regulatory sites have been collected in TRANSFAC (Heinemeyer, 1998, 1999) which is the most complete and well maintained database on transcription factors, their genomic binding sites, and DNA-binding pro les. BrNazma et al. (1997) developed a general software tool to nd and analyze combinations of transcription factor binding sites that occur often in upstream gene regions in the yeast genome. In addition to analyzing the association rules in the combinations, their work focused on promoter and random regions, in which their ratio appears. To handle large amounts of data, data mining plays a prominent role in knowledge extraction. The enormous number of sequenced genomes, gene identi cation data, gene expression experimental pro les, and gene categorized in functional classes allows the use of computational techniques to investigate transcriptional regulatory elements in the gene promoter regions and to decipher the mechanisms of gene transcriptional regulation. Agrawal et al. (1993, 1994) introduced the problem of mining association rules. The rst part of this study is to describe a database of repetitive elements RSDB. The RSDB database contains repetitive elements, which are classi ed into three categories: exact, tandem, and similar. The interfaces needed to query and show the results and statistical data, such as the relationship between repetitive elements and genes, cross-references of repetitive elements among different organisms, and so on, are provided. The second part of this study initially identi es the combinations of known regulatory sites from TRANSFAC and overrepresented repetitive oligo-mers from RSDB in the promoter regions of a particular set of genes selected. The data mining approach, with its mining association rules, is then applied to mine the associations from the combinations of overrepresented repetitive elements and known regulatory sites. We prune the association rules statistically by chi-square testing to remove signi cant ones. Those repetitive sequences in the signi cant rules are candidates of putative regulatory sites. 2. SYSTEM ARCHITECTURE RSDB is built on a personal computer with two Intel Pentium III 800 processors and 1GB physical memory. The operating system RSDB uses is Red Hat Linux 6.2 and the version of Oracle DBMS is server/repbase.html 2

3 THE REPETITIVE SEQUENCE DATABASE 623 FIG. 1. The relationship between repeats data. Table 1. The Amount of Repetitive Sequences in RSDB The contents of RSDB (24 organisms) length 7 Identical repeats Copies of repeats (Repeat) Sequences 71 million 218 million 59 million FIG. 2. System architecture of RSDB. The relationship between REPEAT, REP_DETAIL, and SEQ tables in RSDB is illustrated in Fig. 1 so that readers can more quickly understand what kinds of data is stored in RSDB and how they relate to one another. The amount of repetitive sequences in RSDB is given in Table 1. The details of the organisms are given in Appendix A. The system architecture is shown in Fig. 2 and the main components of RSDB are listed as follows. 1. Thin client (web browser). 2. Oracle DBMS. 3. Apache web server C PHP scripting engine. 4. Suf x Array daemon with organisms suf x array data structures. 5. Delta daemon with related hash tables.

4 624 HORNG ET AL Tools 3. TOOLS, FUNCTIONS, AND STATISTICS ON RSDB RSDB currently provides the following tools: ² Search By Feature ² Search By Range ² Search By Repeats Pattern ² Search By Nonrepeats Pattern ² Search By Accession Number or Sequence ID ² Search By Tandem Repeats ² Personal Query History The query Search By Feature allows users to view the chosen repeats by selecting their features. These features are length, copy, AT ratio, way, type, distribution, and scope. With the query Search By Range, the users can nd repeats in a certain range in a chromosome by specifying the beginning and end positions in the chromosomes. Upon receiving this query, RSDB will provide the number of identical repeats and total copies located in the speci ed range of the chromosome. RSDB will show the statistics for each 4k interval in the speci ed range. For example, when users want to know the number of repeats within the range from the rst bp to the 50,000th bp in the rst chromosome of C.E., RSDB can also provide the statistics for the ranges 1 st 4000 th ; 4001 st 8000 th ; st th, and so on. The query Search By Repeats Pattern is used to search for the patterns which are identical to the sequences referred by the repeats. The query Search By ID Sequence or Access Number is used to view or download the at les of repeats or sequences. That is, users can input either accession numbers or sequence IDs; display format customization is also provided. The query Search By Tandem Repeats, which is for nding tandem repeats, allows users to choose the features of the tandem repeats and nd the matched ones. The functionality of this tool is typically similar to Search By Feature Queries on repetitive elements containing a given sequence In RSDB, all the tools provided on the web are based on a variety of queries. But not all the queries are suitable for the query execution in DBMS. For example, if users submit a query to the Search By Repeats Pattern tool using regular expressions, such as ATT {3, 10} T, to describe the search condition, RSDB can t handle it. In addition, the repeat AAA will be removed if another repeat AAAT is found (that is, the former repeat is a substring of the latter) and AAAT occurs wherever AAA occurs. Because of this de nition, users cannot nd any shorter repeats, such as AAA, in this case. Of course, we could search all the repeat sequences in RSDB for this short pattern, but that would be time consuming. To solve these problems, our RSDB can return all the positions on the organism s chromosome where this pattern occurs to users immediately. Then users can use these positions to query RSDB again to nd the relationship between this pattern and other repeats. The processing is given in Fig. 3. In this example, the user would like to nd all repetitive elements in RSDB containing the sequence ATCGAA. The RSDB will locate the repetitive elements AAATCGAA, ATCGAAGGC, and ATCGAAGAG that contain the user s query pattern ATCGAA in a very short time. The users can get the positions immediately after submitting the query. Then the users can see how these positions are related to some repeats by clicking Show Related Repeats. As shown in Fig. 4 and Fig. 5, the 25 positions where the pattern occurs are included in 176 repeat copies, with the latter consisting of 172 identical repeats Querying repetitive elements in a given sequence In addition to searching for a nonrepeat pattern, users may want to input a long sequence to nd out the number of repeats existing within. Intuitively, this can be done by searching the whole RSDB for every subsequence of the given long sequence, but that approach takes a lot of time.

5 THE REPETITIVE SEQUENCE DATABASE 625 FIG. 3. Processing ow of Search By Nonrepeat Pattern. Thus, we have designed a mechanism to solve this problem more ef ciently. With this mechanism, RSDB can now provide a powerful tool similar to the famous BLAST (Altschul, 1990). It allows the user to nd how many repetitive elements exist in RSDB within a shorter period of time. Subsequences of the sequence must be searched in RSDB and this is time-consuming. However, our RSDB can answer the queries like this in a very short time. A user may submit two types of questions. First, a user would like to know whether there exist repetitive elements in a given sequence. Our algorithm to nd the repetitive elements can answer this type of question. Second, a user may ask how many repetitive elements exist in RSDB by giving a sequence. This question is to search for the patterns which are identical to any subsequences in the given sequence. As shown in Fig. 6, a biologist asks how many repetitive elements can be found in the RSDB by giving the sequence ATCGGA: : : ATAAC. The biologist only needs to enter the sequence and our RSDB will use the mechanism to return repetitive elements, such as the accession numbers CE and CE in this example, with much faster results. FIG. 4. Positions returned in Search By Nonrepeat Pattern.

6 626 HORNG ET AL. FIG. 5. Relationship between positions and repeats. FIG. 6. Querying repetitive elements from a given sequence ATCGGA: : : ATAAC Relationship between repeats and genes Depending on user requirements, the relationship between repeats and genes (or proteins) can be de ned as 10 types, i.e., Form 1 Form 10, and the relationships are illustrated in Figure 7. With this tool, RSDB will show how many genes and proteins in this organism each repeat relates to, and with which relationship (Form 1 10). One example is given in Fig. 8. The interface to show the relationship between repetitive elements and genes is given in Figs FIG. 7. All forms between genes and repeats.

7 THE REPETITIVE SEQUENCE DATABASE 627 FIG. 8. Explanation of relationship between genes, proteins, and repeats. FIG. 9. Web pages for related genes and proteins for repeats (1/2). As shown in Figs. 9 11, RSDB shows the number of genes and proteins related to a certain repeat and how they correlate to each other. But, in regard to the performance, RSDB can nd the relationship only between genes (or proteins) and repeats with less than 100 copies. In addition to at les, more importantly, RSDB can show users the relationship between genes (or proteins) and repeats, as illustrated in Fig Statistics on RSDB Table 2 shows partial statistics for the number of identical repetitive elements, number of copies, and ratio of the ranges of different lengths. The ranges of different lengths are divided into ve ranges, i.e.: 7 10, 11 30, , 101 1; 000, and > 1; 000. The de nition of identical repetitive elements and element copies repetitive is given in Fig. 1. As to the ratio in the third column, it is similar to the de nition of percent of genome in the work Li (2001), which means the percentage of repetitive elements in an organism. The complete statistical data can be found at Table 3 shows partial results of the cross-references of repetitive elements in different organisms. For example, there is a surprisingly large number of repetitive elements, i.e., 1,287,275, in the organisms

8 628 HORNG ET AL. FIG. 10. Web pages for related genes and proteins for repeats (2/2). FIG. 11. Information about related genes and proteins in at les.

9 THE REPETITIVE SEQUENCE DATABASE 629 Table 2. Repetitive Elements in the Organisms Listed by Different Length Regions Name Identical Copies % Identical Copies % Identical Copies % Bacillus subtilis % % % Haemophilus in uenzae Rd % % % Helicobacter pylori J % % % Helicobacter pylori % % % Mycoplasma genitalium % % % Mycobacterium tuberculosis H37Rv % % % Escherichia coli % % % Aquifex aeolicus % % %

10 630 HORNG ET AL. Table 3. Cross-References of Repetitive Elements in Different Organisms Table 4. The Abbreviation of the Organisms ID Name 1 Caenorhabditis elegans 2 Homo sapiens 3 Saccharomyces cerevisiae 4 Bacillus subtilis 5 Haemophilus in uenzae Rd 6 Helicobacter pylori J99 7 Helicobacter pylori Mycoplasma genitalium Table 5. The Longest Length and the Maximum Copies of Repetitive Elements Name Longest length Max copies Caenorhabditis elegans Homo sapiens Saccharomyces cerevisiae Bacillus subtilis Haemophilus in uenzae Rd Helicobacter pylori J Helicobacter pylori Mycoplasma genitalium Mycobacterium tuberculosis H37Rv Escherichia coli Aquifex aeolicus Aeropyrum pernix K Caenorhabditis elegans and Homo sapiens. Table 4 gives the ID and its species for each organism. We believe the statistics reveals an interesting nding. Table 5 shows the longest length of repetitive elements and the maximum copies of repetitive elements in organisms. 4. APPLICATIONS OF REPETITIVE ELEMENTS IN RSDB Li et al. (2001) have analyzed the draft human genome sequence for data related to evolutionary genomics and they reveal new information about repetitive elements (i.e., SINEs and LINEs), domain sharing and conservation, and gene duplication in the human genome. Their analysis has provided some evolutionary

11 THE REPETITIVE SEQUENCE DATABASE 631 insights into the human genome. There are many repetitive elements in complete genomes in our RSDB, and they would be very important features in the evolutionary genomics. The application (Horng, 2001a) attempts to discover putative regulatory elements to investigate how combinations of the known regulatory sites and over-represented repetitive elements are distributed in the promoter regions of groups of functionally related genes. We present the application in detail in next section. We had experimented on C. elegans from Sanger and NCBI to nd the longest repetitive element. Theoretically, two copies of the longest repetitive element should be identical if the sequences in Sanger and NCBI are assembled correctly. However, not only had we found this to be incorrect, but also the difference between them is large. The sequence in Sanger is corrected when we access it again. Similarly, we observed that the sequence C. elegans differs for copies in different time periods, i.e., 1998, 1999, 2000, and The longer repetitive elements disappeared year-by-year. This seems to reveal errors in fragment assembly if many long repetitive elements are found in a sequence. Another application of repetitive elements in RSDB is the design of primers and markers to extract the fragment sequence from the sequenced genomes. The basic idea is that the primer and marker sequences have the property of being nonrepetitive in contrast to the repetitive sequences in genomes. To design primers and markers in a genome, all repetitive elements are marked in the genome rst, and those regions that are not covered by any repetitive elements are nonrepetitive sequences and are candidates for primers or markers. In addition, a pair of candidate sequences of primers or markers extracted from those nonrepetitive regions can be used to tailor the different lengths of target sequences such as genes, STR, SINEs, LINEs, and so on Approach 5. MINING PUTATIVE BINDING SITES USING REPETITIVE ELEMENTS IN RSDB The proposed approach is given as follows. We rst preprocess to nd the combinations of known regulatory sites and overrepresented repetitive oligomers located in the promoter regions of the groups of functionally related genes. Next, Apriori and AprioriTid (Agrawal, 1993, 1994) are applied to mine the association rules by combining the known regulatory sites and overrepresented repetitive oligomers. Chi-square is then used to select certain rules. Finally, the overrepresented repetitive oligomers in the association rules are selected as putative regulatory elements Materials Before analysis of the associations of known regulatory sites and overrepresented repetitive sequences located in promoter regions, the whole sequence of target genome and the gene annotations are obtained from NCBI. The experimental identifying transcription factor binding sites can be obtained from TRANSFAC. TRANSFAC database (Heinemeyer, 1998, release 4.0) contains 4,965 site sequences, and 2,837 factor entries. Most sites are also consensus patterns. The data in TRANSFAC has the following features. A transcription factor binding site accession number may have different consensus sequences. Different binding site accession numbers may have the same consensus sequence. Wild characters such as M or W used in TRANSFAC make the sequences cover other sequences. Small consensus sequences may appear in larger ones. The repetitive sequence of the target genome, i.e., yeast, can be obtained from the repetitive sequence database (RSDB) (Horng, 2001b). In MIPS, 6,350 Yeast genes and ORFs are documented (Mewes, 1999), and 3,529 genes are classi ed into at least one functional catalogue Statistical analysis of overly-represented repetitive oligonucleotides Nucleotide succession is not random, and some oligonucleotides are clearly overrepresented, noticeably the poly (A), poly (T), and poly (AT) chains. An additional bias results from the fact that oligonucleotides are differently represented in coding region versus noncoding sequences (van Helden et al., 1998). Thus, a speci c expected frequency has to be used for each oligonucleotide sequence. Van Helden et al. (1998) proposed a statistical method to estimate the probability to observe exactly n occurrences of the oligo-

12 632 HORNG ET AL. nucleotide b within promoter regions of a gene family by the bionomial formula. The values with the highest probability are the most overrepresented oligomers. The advantage of the signi cance coef cient is that its threshold can be selected and its values interpreted independently of oligonucleotide size, upstream sequence size, and number of genes within the family. The overrepresented repetitive sequences of yeast are obtained by applying the statistical method mentioned by van Helden et al. (1998). We set the threshold to 0. The repetitive sequences are selected as signi cant overrepresented repetitive oligomers if their coef cients are larger than the threshold Preprocessing and mapping The transcription factor binding sites in TRANSFAC and repetitive elements in RSDB are rst prepared. The known regulatory sites and repetitive sequences are then located in the promoter regions of groups of functionally related genes. Combinations of the known regulatory sites and overrepresented repetitive sequences located within the promoter regions are found. For more details about the preprocessing, the reader may refer to Horng (2001a). In the following, we describe how to mine associations from the combinations of the transcription factor binding sites and overrepresented repetitive sequences found. Consider a large database with transactions, where each transaction consists of a set of items. An association rule is an expression such as A ) B, where A and B are the sets of items. The mining of an association rule is that a transaction in the database that contains A also tends to contain B. For example, 90% of the people who purchase beer also purchase diapers. Herein, 90% is called the con dence of the rule. The support of the rule A ) B given herein is the percentage of transactions that contain both A and B. The formal statement of the problem is described below. Let I D fi 1 ; i 2 ; : : : ; i m g be a set of sites, called item set. Let D be a set of repeat sequences, where each repeat sequence S corresponding to a transaction contains a set of items such that S µ I. Let S D fs 1 ; s 2 ; : : : ; s m g be a set of transcription factor binding sites in TRANSFAC and R D fr 1 ; r 2 ; : : : ; r n g be a set of overrepresented repetitive sequences from RSDB. The union of the sets S and R is called the item set. Let G D fg 1 ; g 2 ; : : : ; g m g be a group of functionally related genes. Each promoter region of a gene corresponding to a transaction contains a set of transcription factor binding sites and overrepresented repeats, also called items. Assume that a promoter region S contains A, a set of items of I, if A µ S. An association rule is an implicate of the form A ) B, where A ½ I, B ½ I, and A \ B D ;. The rule A ) B holds in the set of promoter regions D with con dence conf if c% of transactions in D contain A and also B. The rule A ) B has support sup in the repetitive sequence set D if s% of promoter regions in D contained A [ B. In our experiments, the minimum support is set to 10%. The association rules are generated if the rule has a higher support and con dence speci ed by user. Apriori and AprioriTid (Agrawal, 1993, 1994) are then applied to mine association rules Results We apply the proposed method to Saccharomyces cerevisiae. The genome sequence and gene information are obtained from GenBank. The functional catalogues of Yeast ORFs are collected in MIPS (Mewes, 1999). The known regulatory sites are obtained from TRANSFAC (Heinemeyer, 1998) and the repetitive sequences of yeast genome are obtained from RSDB (Horng, 2001b). Table 6 shows the numbers of known regulatory sites and overrepresented repetitive elements in promoter regions related to genes of functional catalogues. Table 7 shows the associations mined by our proposed approach. The description of functional catalogues and ORFs in these catalogues is given in Appendixes B and C. The rst column in Table 7 is the functional catalogue from MIPS; the second one denotes the numerical format of functional categories; the third column is the number of genes in the corresponding family; the fourth one is the number of associations discovered by data mining; the fth one is the maximum number of items in a transaction; the sixth one is the average number of items in a transaction; the last one is the number of signi cant associations after applying the Chi-square test. The minimum support and con dence are set to 60%. The notation of the associations is given in Fig. 12. Some interesting and signi cant association rules in each functional catalogue are shown in Table 8. The rst column in Table 8 is the functional catalogues from MIPS; the second one is the associations containing known regulatory sites in uppercase and over-represented repetitive oligomers in lowercase; the third one is the con dence of the association; the

13 THE REPETITIVE SEQUENCE DATABASE 633 Table 6. The Amounts of Known Regulatory Sites and Over-Represented Repetitive Elements Related to Groups of Functionally Related Genes Functional No. of related No. of related known Functional catalogues over-represented regulatory sites catalogues in MIPS in numeric repetitive elements in TRANSFAC Glycolysis and gluconeogenesis DNA synthesis and replication Ribosomal RNA synthesis trna synthesis Ion transporters Transport ATPases ABC transporters Drug transporters Table 7. The Mined Association Rules After Pruning by Chi-Square No. of No. of Functional associations Average signi cant Functional catalogues No. of before Max no. no. of association catalogues in MIPS in numeric genes pruning of items of items rules Glycolysis and gluconeogenesis DNA synthesis and replication Ribosomal RNA synthesis trna synthesis Ion transporters Transport ATPases ABC transporters Drug transporters FIG. 12. An illustrative example of prediction of putative regulatory sites.

14 Table 8. Partial Signi cant Associations Mined in Each Functional Catalogue Functional classi cation catalogues in MIPS Associations Conf Sup  2 Assocations Glycolysis and gluconeogenesis AAGGAA ) acatat c-ets-2 ) acatat CTATC ) atgcaa NF-E ) atgcaa Atggaa, CTAAT, TCTCC, TTCAAA ) aaaaac atggaa, unknown, ADR1, TFIID ) aaaaac AAGGAA, atggaa, TGGCA, TTCAAA ) taagaa c-ets-2, atggaa, NF-1/L, TFIID ) taagaa DNA synthesis and replication GCAAT ) attgca unknown ) attgca GCAAT ) attatg unknown ) attatg CTAAT ) attcac unknown ) attcac CTAAT ) tgtgaa unknown ) tgtgaa trna synthesis AGAGG, cttcta ) CTGTC unknown, cttcta ) NF-E CATCC, TATAT, TGGCA ) aaatta GCR1, unknown, NF-1=L ) aaatta AAGGAA, AGAGG, ATTGG, CCAAT, cttcta, GATAA, c-ets-2, unknown, unknown, SRF, cttcta, GAL4, NGGRGK, TATAT, TTATC ) TTCCTT NGGRGK, unknown, DBF-A ) c-ets-2 aaaatt, AAGGAA, ATTGG, CATCC, CCAAT, CTATC, NGGRGK, TATAT, ttaaaa ) TTCCTT aaaatt, c-ets-2, unknown, GCR1, SRF, NF-E, NGGRGK, unknown, ttaaaa ) c-ets-2

15 THE REPETITIVE SEQUENCE DATABASE 635 Table 9. The Occurrences of the Association (NF-E ) atgcaa) in the Functional Catalogue Glycolysis and Gluconeogenesis ORFs YBR218c YBR221c YCR012w YDR050c YER178w YGL253w YGR192c YGR240c YGR254w YHR174w YJL052w YJR009c YKL060c YKL152c YKR043c YLR345w YMR280c YMR323w YOL056w YOR283w YOR393w YPL281c Associations (NF-E ) atgcaa) [-896]-CTATC-[347]-CTATC-[122]-atgcaa-[337]-atgcaa- [-720]-atgcaa-[100]-CTATC-[44]-CTATC-[281]-atgcaa-[29]-CTATC- [-678]-CTATC-[535]-atgcaa-[16]-atgcaa- [-789]-atgcaa-[598]-CTATC- [-487]-CTATC-[15]-atgcaa- [-772]-atgcaa-[586]-CTATC-[97]-CTATC- [-751]-CTATC-[53]-atgcaa-[603]-atgcaa-[24]-atgcaa- [-492]-atgcaa-[84]-atgcaa-[208]-CTATC- [-474]-CTATC-[137]-atgcaa- [-777]-CTATC-[166]-CTATC-[226]-atgcaa-[132]-CTATC- [-670]-CTATC-[40]-CTATC-[536]-atgcaa- [-895]-CTATC-[75]-CTATC-[391]-atgcaa- [-239]-atgcaa-[98]-CTATC- [-910]-CTATC-[385]-CTATC-[100]-atgcaa-[94]-CTATC- [-970]-CTATC-[191]-atgcaa-[10]-CTATC-[15]-CTATC- [-663]-CTATC-[545]-CTATC-[68]-atgcaa- [-498]-CTATC-[320]-CTATC-[93]-atgcaa-[38]-CTATC- [-759]-CTATC-[52]-atgcaa-[291]-atgcaa-[282]-atgcaa- [-768]-atgcaa-[16]-CTATC-[149]-atgcaa-[54]-CTATC-[168]-CTATC- [-868]-CTATC-[56]-atgcaa- [-767]-CTATC-[52]-atgcaa- [-292]-atgcaa-[53]-CTATCfourth one is the support value; the fth one is the Chi-square value; the last one is similar to the second one except for the use of transcription factor names. The putative regulatory elements of the functional catalogue Glycolysis and gluconeogenesis mined by our proposed method are listed in Appendix D. Table 9 shows an example of the occurrences of the association [NF E ) atgcaa] in the functional catalogue Glycolysis and gluconeogenesis. The rst column in Table 9 denotes the name of the yeast ORF; the second one is the positions of known regulatory sites or putative regulatory sites in the promoter region. For example, the rst row in Table 9, YBR218, is the ORFs name, and [-896]- CTATC- [347]- CTATC- [122]- atgcaa-[337]-atgcaa are the compositions of the associations. The rst number, [-896], denotes the offset of the site CTATC from the start position of coding region, and the distance between the rst and second occurrences is denoted as [347] nucleotides. Table 10 shows an example of the occurrences of the association, [unknown; cttcta ) NF E], in the functional catalogue trna synthesis Observations We further observe the repetitive property of the transcription factor binding sites in TRANSFAC (Heinemeyer, 1998) for the yeast genome. Table 11 shows the occurrences of partial transcription factor binding sites in TRANSFAC (Heinemeyer, 1998) in Yeast Genome. For example, the rst site, TATATAAT, appears 1,691 times in the yeast genome. The site appears 418 times in the upstreams of genes and 656 times in genes. We further nd 679 genes contain the site. That is, the site perhaps appears in upstreams, or genes, or both. From Table 11, we nd that our repetitive elements in RSDB are candidates of putative regulatory elements, and the data in the Table 11, seems to indicate that our approach is promising. 6. CONCLUSIONS Around 43% of the human genome is occupied by repetitive elements. Moreover, around 51% of the rice genome is occupied by repetitive elements. The analysis of repetitive elements reveals repetitive elements in our genome may have been very important in the evolutionary genomics.

16 636 HORNG ET AL. Table 10. The Occurrences of the Association (Unknown a, cttcta, and NF-E) in Functional Catalogue-tRNA Synthesis ORFs YBR123c YDL150w YDR362c YER148w YGL019w YHR143w YKL144c YNL039w YNR003c YOR110w YOR116c YOR210w YPR186c Associations (unknown, cttcta, and NF-E) [-906]-CTGTC-[61]-AGAGG-[740]-cttcta-[21]-AGAGG-[68]-CTGTC- [-581]-AGAGG-[6]-AGAGG-[81]-cttcta-[32]-CTGTC-[43]-cttcta-[58]-AGAGG-[257]-AGAGG- [4]-CTGTC- [-766]-cttcta-[19]-CTGTC-[37]-CTGTC-[7]-AGAGG-[49]-CTGTC-[422]-cttcta-[47]-AGAGG- [142]-CTGTC- [-841]-CTGTC-[55]-cttcta-[374]-AGAGG- [-970]-CTGTC-[20]-AGAGG-[200]-AGAGG-[507]-CTGTC-[207]-cttcta- [-985]-AGAGG-[15]-AGAGG-[3]-cttcta-[54]-cttcta-[78]-CTGTC-[175]-CTGTC-[148]-AGAGG- [-613]-AGAGG-[21]-CTGTC-[308]-AGAGG-[57]-AGAGG-[35]-AGAGG-[17]-cttcta- [-952]-CTGTC-[269]-AGAGG-[113]-cttcta-[388]-AGAGG-[95]-AGAGG-[47]-cttcta- [-493]-cttcta-[123]-AGAGG-[43]-cttcta-[166]-CTGTC- [-890]-CTGTC-[59]-CTGTC-[279]-cttcta-[148]-AGAGG- [-856]-CTGTC-[124]-AGAGG-[151]-cttcta-[423]-AGAGG- [-799]-AGAGG-[136]-CTGTC-[117]-cttcta- [-757]-AGAGG-[26]-CTGTC-[81]-AGAGG-[296]-cttcta-[106]-AGAGG-[15]-CTGTCa The patterns are found in TRANSFAC with unknown name. The contributions of the rst part in this study are brie y stated below. We develop a database for repetitive elements and provide users with not only query tools but also more useful and maybe signi cant statistics. Some of the statistics are related to the entire data in RSDB, and some are particularly related to the data which match users queries. We hope that these statistics can help biologists discover a whole new world in biology some day. RSDB not only stores data of repeats, but also integrates some data of genes into this system. This makes RSDB more comprehensive. RSDB also provides useful tools such as BLAST to query repetitive elements in RSDB from a given sequence or to nd repetitive elements containing a given sequence. Table 11. The Occurrences of Partial Transcription Factor Binding Sites in TRANSAC (Heinemeyer, 1998) in Yeast Genome Occurrences (times) No. of genes Sites in Accession Site ID in containing TRANSFAC number TRANSFAC Genome Upstreams Genes the site TATATAAT R01732 AD$E1B_ AGAACA R00973 CHICK$LYS_ TATAAAA R00047 DROME$AC5C_ AAGGAA R04338 HS$CDC2_ AAGGAA R04338 HS$CDC2_ AAGTGA R00917 HS$IFNB_ GGCGCG R01772 HSV1$TK_ TAAAAAA R04010 MOUSE$ADA_ TTCCTC R04413 MOUSE$FCGR3A_ GCCAAT R03384 MOUSE$JUND_ TATAAGA R03388 MOUSE$JUND_ TTCAAA R02749 MOUSE$MBP_ TTTAAA R01598 MOUSE$WAP_ ATAAATA R01371 SV$SV40_ TTTTTTG R01566 VIV$VISNA_ TTTTTTG R01566 VIV$VISNA_ AAGTACAT R01361 Y$STE2_

17 THE REPETITIVE SEQUENCE DATABASE 637 The contribution of the second part of this study is to nd combinations of known regulatory sites and overrepresented repetitive oligomers located within the promoter regions of groups of functionally related genes. The data mining techniques are then applied to mine the associations. Our proposed approach can mine putative regulatory elements of any complete genome, such as yeast, in this study. The parameters to identify overrepresented repetitive sequences within promoter regions of genes can be speci ed by users according to their needs. The discovered associations of known and putative regulatory elements can also provide effective information to researchers studying in the mechanisms of gene transcriptional regulation. APPENDIX A: THE AMOUNT OF REPETITIVE ELEMENTS IN THE ORGANISMS Organism Id Organism name Identical repeat Total copies 1 Caenorhabditis elegans Homo sapiens Saccharomyces cerevisiae Bacillus subtilis Haemophilus in uenzae Rd Helicobacter pylori J Helicobacter pylori Mycoplasma genitalium Mycobacterium tuberculosis H37Rv Escherichia coli Aquifex aeolicus Aeropyrum pernix K Archaeoglobus fulgidus Chlamydia pneumoniae AR Chlamydia trachomatis Mycoplasma pneumoniae M Pyrococcus horikoshii OT Rickettsia prowazekii strain Madrid E Synechocystis PCC Thermotoga maritima Treponema pallidum subsp. pallidum Ureaplasma urealyticum Pyrococcus abyssi Arabidopsis thaliana APPENDIX B: THE FUNCTIONAL CATALOGUES AND THEIR ORFS STUDIED IN THIS WORK (FIRST GROUP) Functional Numerical catalogues in MIPS format ORFs Glycolysis and gluconeogenesis YAL038w, YBR196c, YBR218c, YBR221c, YCL040w, YCR012w, YDL021w, YDR050c, YER065c, YER178w, YGL062w, YGL253w, YGR192c, YGR193c, YGR240c, YGR254w, YHR174w, YJL052w, YJR009c, YKL060c, YKL152c, YKR043c, YKR097w, YLR345w, YLR377c, YMR205c, YMR280c, YMR323w, YNL071w, YOL056w, YOR283w, YOR347c, YOR393w, YPL281c (continued)

18 638 HORNG ET AL. APPENDIX B: (Continued) Functional Numerical catalogues in MIPS format ORFs DNA synthesis and replication YAL040c, YAR007c, YBL023c, YBL035c, YBR060c, YBR087w, YBR088c, YBR160w, YBR195c, YBR202w, YBR278w, YCR028c-a, YCR077c, YDL017w, YDL102w, YDL164c, YDR052c, YDR054c, YDR068w, YDR110w, YDR206w, YDR364c, YDR499w, YEL032w, YER070w, YER176w, YFL036w, YFR027w, YFR028c, YGL058w, YGL201c, YGR109c, YGR132c, YGR180c, YGR231c, YHR118c, YHR164c, YIL036w, YIL066c, YIL139c, YIL150c, YIR008c, YJL026w, YJL065c, YJL090c, YJL173c, YJL194w, YJR006w, YJR043c, YJR046w, YJR068w, YKL045w, YKL108w, YKL112w, YKL113c, YLL004w, YLR103c, YLR182w, YLR233c, YLR274w, YLR369w, YML058w, YML065w, YML102w, YMR001c, YMR072w, YMR241w, YMR284w, YNL088w, YNL102w, YNL213c, YNL218w, YNL261w, YNL262w, YNL290w, YNL312w, YOL006c, YOL094c, YOL095c, YOL115w, YOR217w, YOR330c, YPL167c, YPL256c, YPR018w, YPR019w, YPR120c, YPR135w, YPR162c, YPR175w rrna synthesis YAL001c, YBL014c, YBL025w, YBR049c, YBR123c, YBR154c, YDL150w, YDR156w, YDR362c, YER148w, YGR047c, YGR246c, YHR143wa, YJL025w, YJL148w, YJR063w, YKL125w, YKL144c, YLR039c, YLR141w, YML043c, YMR270c, YNL039w, YNL113w, YNL151c, YNL248c, YNR003c, YOR110w, YOR116c, YOR207c, YOR210w, YOR224c, YOR340c, YOR341w, YPR010c, YPR110c, YPR186c, YPR187w, YPR190c trna synthesis YAL001c, YBR123c, YBR154c, YDL150w, YDR362c, YER148w, YGL019w, YGR047c, YGR246c, YHR143wa, YKL144c, YNL039w, YNL113w, YNL151c, YNR003c, YOR110w, YOR116c, YOR207c, YOR210w, YOR224c, YPR110c, YPR186c, YPR187w, YPR190c APPENDIX C: THE FUNCTIONAL CATALOGUES AND THEIR ORFS STUDIED IN THIS WORK (SECOND GROUP) Functional catalogues Numerical in MIPS format ORFs Ion transporters YAL026c, YBR127c, YBR207w, YBR235w, YBR291c, YBR294w, YBR295w, YBR296c, YCR024c-a, YCR037c, YDL128w, YDL185w, YDR011w, YDR270w, YDR456w, YEL017c-a, YEL027w, YEL031w, YEL051w, YEL065w, YER053c, YER145c, YFL041w, YFL050c, YGL006w, YGL008c, YGL167c, YGL255w, YGR020c, YGR065c, YGR121c, YGR191w, YHL016c, YHR026w, YHR039c-a, YHR175w, YJL094c, YJL117w, YJL129c, YJL198w, YJR040w, YJR077c, YKL080w, YKL120w, YKR050w, YLR092w, YLR130c, YLR138w, YLR348c, YLR411w, YLR447c, YML123c, YMR054w, YMR058w, YMR243c, YMR319c, YNL142w, YNL259c, YNR013c, YNR039c, YOL122c, YOL130w, YOR153w, YOR270c, YOR316c, YOR332w, YPL036w, YPL176c, YPL224c, YPL234c, YPR003c, YPR036w, YPR124w, YPR138c, YPR201w (continued)

19 THE REPETITIVE SEQUENCE DATABASE 639 APPENDIX C: (Continued) Functional catalogues Numerical in MIPS format ORFs Transport ATPases Q0080, Q0085, Q0130, YAL026c, YBL099w, YBR039w, YBR127c, YBR295w, YCR024c-a, YDL004w, YDL185w, YDR038c, YDR039c, YDR040c, YDR093w, YDR270w, YDR298c, YEL017c-a, YEL027w, YEL031w, YEL051w, YER166w, YGL006w, YGL008c, YGL167c, YGR020c, YHR026w, YHR039c-a, YIL048w, YJR121w, YKL016c, YKL080w, YLR295c, YLR447c, YMR054w, YMR162c, YOR270c, YOR332w, YPL036w, YPL078c, YPL234c, YPL271w, YPR036w ABC transporters YCR011c, YDR011w, YDR091c, YDR135c, YDR406w, YER036c, YFR009w, YGR281w, YHL035c, YIL013c, YKL188c, YKL209c, YKR103w, YKR104w, YLL015w, YLL048c, YLR188w, YMR301c, YNL014w, YNR070w, YOL075c, YOR011w, YOR153w, YOR328w, YPL058c, YPL147w, YPL226w, YPL270w Drug transporters YBR008c, YBR043c, YBR052c, YBR180w, YBR293w, YCL069w, YCL073c, YDR119w, YEL065w, YGR138c, YGR197c, YGR224w, YGR281w, YHL040c, YHL047c, YHR032w, YHR048w, YIL013c, YIL048w, YIL120w, YIL121w, YKR105c, YKR106w, YLL028w, YML116w, YMR088c, YMR123w, YMR279c, YNL065w, YNR055c, YOL158c, YOR273c, YOR378w, YPR156c, YPR198w APPENDIX D: THE ASSOCIATION OF REGULATORY SITES MINED IN THE CATALOGUE OF GLYCOLYSIS AND GLUCONEOGENESIS Functional classi cation catalogues Glycolysis and gluconeogenesis ORFs Associations Conf. Sup. Â 2 Associations AAGGAA ) acatat c-ets-2 ) acatat CTATC ) atgcaa NF-E ) atgcaa AAGGAA, CCAAT ) acaaaa c-ets-2,srf ) acaaaa AGAGG, AGGAAA ) aaaatc unknown,pea3 ) aaaatc AGAGG, GTCAC, TGACG ) CTATC unknown, unknown, CREB ) NF-E AGAGG, TGACG ) CTATC unknown, CREB ) NF-E AGAGG, ggaaaa ) aaaatc unknown, ggaaaa ) aaaatc ATGAAAA ) aaaaac Pit-1a ) aaaaac atgcaa ) CTATC atgcaa ) NF-E atgcaa ) GCAAT atgcaa ) unknown atggaa, CTAAT, TCTCC, TTCAAA ) aaaaac atggaa, unknown, ADR1, TFIID ) aaaaac TGGCA ) atggaa NF-1/L ) atggaa CTAAT ) TAAAAT unknown ) Pit-1a/F2F CTAAT ) TCTCC unknown ) ADR1 AAGGAA, TGGCA, TTCAAA ) c-ets-2, NF-1/L, TFIID ) taagaa taagaa AAGGAA, acaaaa ) CCAAT c-ets-2, acaaaa ) SRF AAGGAA, atggaa, TGGCA, TTCAAA ) taagaa c-ets-2, atggaa, NF-1/L, TFIID ) taagaa

20 640 HORNG ET AL. ACKNOWLEDGMENTS The authors would like to thank the National Science Council of the Republic of China and the Asia Bioinnovation Corporation for nancial support of this research. Prof. Ueng-Cheng Yang and Dr. Yu-Chung Chang are appreciated for their valuable discussion regarding molecular biology. Prof. Cheng-Yan Kao is also commended for his suggestions regarding our database. REFERENCES Agrawal, R., Imielinski, T., and Swami, A Mining associations between sets of items in large databases. Proc. ACM SIGMOD Int. Conf. Management of Data, Washington D.C., Agrawal, R., and Srikant, R Fast algorithms for mining association rules, Proc. 20th Int. Conf. Very Large Databases, Santiago, Chile, Sept. Expanded version available as IBM Research Report RJ9839, Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J Basic local alignment search tool. J. Mol. Biol. 215, Benson, D.A., Boguski, M.S., Lipman, D.J., Ostell, J., and Ouellette, B.F B.A. Rapp, and D.L. Wheeler, GenBank. Nucl. Acids Res. 27, Br Nazma, A., Jonassen, I., Vilo, J., and Ukkonen, E Predicting gene regulatory elements in silico on a genomic scale. Genome Res. 8, Br Nazma, A., Vilo, J., Ukkonen, E., and Valtonen, K Data mining for regulatory elements in yeast genome, Proc. 5th Int. Conf. Intelligent Systems for Molecular Biology. AAAI Press, De Risi, J.L., Iyer, V.R., and Brown, P.O Exploring the metabolic and genetic control of gene expression on a genomic scale. Science 278, Heinemeyer, T., Chen, X., Karas, H., Kel, A.E., Kel, O.V., Liebich, I., Meinhardt, T., Reuter, I., Schacherer, F., and Wingender, E Expanding the TRANSFAC database towards an expert system of regulatory molecular mechanisms. Nucl. Acids Res. 27, Heinemeyer, T., Wingender, E., Reuter, I., Hermjakob, H., Kel, A.E., Kel, O.V., Ignatieva, E.V., Ananko, E.A., Podkolodnaya, O.A., Kolpakov, F.A., Podkolodny, N.L., and Kolchanov, N.A Databases on transcriptional regulation: TRANSFAC, TRRD and COMPEL. Nucl. Acids Res. 26, Horng, J.T., and Cho, W.F Predicting regulatory elements in repetitive sequence using transcription factor binding sites. Electronic J. Biotechnology 3, 3, Dec. 15. Horng, J.T., Huang, H.D., Huang, C.C., and Kao, C.Y. 2001a. Mining putative regulatory elements in gene promoter regions. Proc. German Conference on Bioinformatics Horng, J.T., Lin, J.H., and Kao, C.Y. 2001b. RSDB-A database of repetitive elements in complete genomes. Proc. Atlantic Symposium on Computational Biology and Genome Information Systems and Technology, Durham, NC, Jurka, J Repeats in genomic DNA: Mining and meaning. Curr. Opin. Struct. Biol. 8, Li, W.H., Gu, Z., Wang, H., and Nekrutenko, A Evolutionary analyses of the human genome. Nature 409, Mewes, H.W., Heumann, K., Kaps, A., Mayer, K., Pfeiffer, F., Stocker, S., and Frishman, D MIPS: A database for protein sequences and complete genomes. Nucl. Acids Res. 27, Sargent, R., Fuhrman, D., Critchlow, T., Sera, T.D., Mecklenburg, R., Lindstrom, G., Schuler, G.D., Epstein, J.A., Ohkawa, H., and Kans, J.A., Entrez: Molecular biology database and retrieval system. Methods Enzymol. 266, van Helden, J., Rios, A.F., and Collado-Vides, J Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies, J. Mol. Biol. 281, Address correspondence to: Jorng-Tzong Horng Department of Computer Science and Information Engineering National Central University No. 38, Wu-chuan Li Chung-li, 320, Taiwan horng@db.csie.ncu.edu.tw

Using graphs to relate expression data and protein-protein interaction data

Using graphs to relate expression data and protein-protein interaction data Using graphs to relate expression data and protein-protein interaction data R. Gentleman and D. Scholtens October 31, 2017 Introduction In Ge et al. (2001) the authors consider an interesting question.

More information

From protein networks to biological systems

From protein networks to biological systems FEBS 29314 FEBS Letters 579 (2005) 1821 1827 Minireview From protein networks to biological systems Peter Uetz a,1, Russell L. Finley Jr. b, * a Research Center Karlsruhe, Institute of Genetics, P.O. Box

More information

CELL CYCLE RESPONSE STRESS AVGPCC. YER179W DMC1 meiosis-specific protein unclear

CELL CYCLE RESPONSE STRESS AVGPCC. YER179W DMC1 meiosis-specific protein unclear ORFNAME LOCUS DESCRIPTION DATE HUBS: ESSENTIAL k AVGPCC STRESS RESPONSE CELL CYCLE PHEROMONE TREATMENT UNFOLDED PROTEIN RESPONSE YER179W DMC1 meiosis-specific protein 9-0.132-0.228-0.003-0.05 0.138 0.00

More information

A Re-annotation of the Saccharomyces cerevisiae Genome

A Re-annotation of the Saccharomyces cerevisiae Genome Comparative and Functional Genomics Comp Funct Genom 2001; 2: 143 154. DOI: 10.1002 / cfg.86 Research Article A Re-annotation of the Saccharomyces cerevisiae Genome V. Wood*, K. M. Rutherford, A Ivens,

More information

Hotspots and Causal Inference For Yeast Data

Hotspots and Causal Inference For Yeast Data Hotspots and Causal Inference For Yeast Data Elias Chaibub Neto and Brian S Yandell October 24, 2012 Here we reproduce the analysis of the budding yeast genetical genomics data-set presented in Chaibub

More information

GENETICS. Supporting Information

GENETICS. Supporting Information GENETICS Supporting Information http://www.genetics.org/cgi/content/full/genetics.110.117655/dc1 Trivalent Arsenic Inhibits the Functions of Chaperonin Complex XuewenPan,StefanieReissman,NickR.Douglas,ZhiweiHuang,DanielS.Yuan,

More information

2 Genome evolution: gene fusion versus gene fission

2 Genome evolution: gene fusion versus gene fission 2 Genome evolution: gene fusion versus gene fission Berend Snel, Peer Bork and Martijn A. Huynen Trends in Genetics 16 (2000) 9-11 13 Chapter 2 Introduction With the advent of complete genome sequencing,

More information

Phylogenetic classification of transporters and other membrane proteins from Saccharomyces cerevisiae

Phylogenetic classification of transporters and other membrane proteins from Saccharomyces cerevisiae Funct Integr Genomics (2002) 2:154 170 DOI 10.1007/s10142-002-0060-8 REVIEW Benoît De Hertogh Elvira Carvajal Emmanuel Talla Bernard Dujon Philippe Baret André Goffeau Phylogenetic classification of transporters

More information

Signal recognition YKL122c An01g02800 strong similarity to signal recognition particle 68K protein SRP68 - Canis lupus

Signal recognition YKL122c An01g02800 strong similarity to signal recognition particle 68K protein SRP68 - Canis lupus Supplementary Table 16 Components of the secretory pathway Aspergillus niger A.niger orf A.niger gene Entry into ER Description of putative Aspergillus niger gene Best homolog to putative A.niger gene

More information

Finding molecular complexes through multiple layer clustering of protein interaction networks. Bill Andreopoulos* and Aijun An

Finding molecular complexes through multiple layer clustering of protein interaction networks. Bill Andreopoulos* and Aijun An Int. J. Bioinformatics Research and Applications, Vol. x, No. x, xxxx 1 Finding molecular complexes through multiple layer clustering of protein interaction networks Bill Andreopoulos* and Aijun An Department

More information

Causal Model Selection Hypothesis Tests in Systems Genetics: a tutorial

Causal Model Selection Hypothesis Tests in Systems Genetics: a tutorial Causal Model Selection Hypothesis Tests in Systems Genetics: a tutorial Elias Chaibub Neto and Brian S Yandell July 2, 2012 1 Motivation Current efforts in systems genetics have focused on the development

More information

Modeling and Visualizing Uncertainty in Gene Expression Clusters using Dirichlet Process Mixtures

Modeling and Visualizing Uncertainty in Gene Expression Clusters using Dirichlet Process Mixtures JOURNAL OF LATEX CLASS FILES, VOL. 1, NO. 8, AUGUST 2002 1 Modeling and Visualizing Uncertainty in Gene Expression Clusters using Dirichlet Process Mixtures Carl Edward Rasmussen, Bernard J de la Cruz,

More information

Alterations in DNA Replication and Histone Levels Promote Histone Gene Amplification in Saccharomyces cerevisiae

Alterations in DNA Replication and Histone Levels Promote Histone Gene Amplification in Saccharomyces cerevisiae Supporting Information http://www.genetics.org/cgi/content/full/genetics.109.113662/dc1 Alterations in DNA Replication and Histone Levels Promote Histone Gene Amplification in Saccharomyces cerevisiae

More information

Adaptation of Saccharomyces cerevisiae to high hydrostatic pressure causing growth inhibition

Adaptation of Saccharomyces cerevisiae to high hydrostatic pressure causing growth inhibition FEBS 29557 FEBS Letters 579 (2005) 2847 2852 Adaptation of Saccharomyces cerevisiae to high hydrostatic pressure causing growth inhibition Hitoshi Iwahashi a, *, Mine Odani b, Emi Ishidou a, Emiko Kitagawa

More information

The Minimal-Gene-Set -Kapil PHY498BIO, HW 3

The Minimal-Gene-Set -Kapil PHY498BIO, HW 3 The Minimal-Gene-Set -Kapil Rajaraman(rajaramn@uiuc.edu) PHY498BIO, HW 3 The number of genes in organisms varies from around 480 (for parasitic bacterium Mycoplasma genitalium) to the order of 100,000

More information

DISCOVERING PROTEIN COMPLEXES IN DENSE RELIABLE NEIGHBORHOODS OF PROTEIN INTERACTION NETWORKS

DISCOVERING PROTEIN COMPLEXES IN DENSE RELIABLE NEIGHBORHOODS OF PROTEIN INTERACTION NETWORKS 1 DISCOVERING PROTEIN COMPLEXES IN DENSE RELIABLE NEIGHBORHOODS OF PROTEIN INTERACTION NETWORKS Xiao-Li Li Knowledge Discovery Department, Institute for Infocomm Research, Heng Mui Keng Terrace, 119613,

More information

Ascospore Formation in the Yeast Saccharomyces cerevisiae

Ascospore Formation in the Yeast Saccharomyces cerevisiae MICROBIOLOGY AND MOLECULAR BIOLOGY REVIEWS, Dec. 2005, p. 565 584 Vol. 69, No. 4 1092-2172/05/$08.00 0 doi:10.1128/mmbr.69.4.565 584.2005 Copyright 2005, American Society for Microbiology. All Rights Reserved.

More information

Causal Graphical Models in Systems Genetics

Causal Graphical Models in Systems Genetics 1 Causal Graphical Models in Systems Genetics 2013 Network Analysis Short Course - UCLA Human Genetics Elias Chaibub Neto and Brian S Yandell July 17, 2013 Motivation and basic concepts 2 3 Motivation

More information

Comparative Methods for the Analysis of Gene-Expression Evolution: An Example Using Yeast Functional Genomic Data

Comparative Methods for the Analysis of Gene-Expression Evolution: An Example Using Yeast Functional Genomic Data Comparative Methods for the Analysis of Gene-Expression Evolution: An Example Using Yeast Functional Genomic Data Todd H. Oakley,* 1 Zhenglong Gu,* Ehab Abouheif, 3 Nipam H. Patel, 2 and Wen-Hsiung Li*

More information

Matrix-based pattern matching

Matrix-based pattern matching Regulatory sequence analysis Matrix-based pattern matching Jacques van Helden Jacques.van-Helden@univ-amu.fr Aix-Marseille Université, France Technological Advances for Genomics and Clinics (TAGC, INSERM

More information

Linking the Signaling Cascades and Dynamic Regulatory Networks Controlling Stress Responses

Linking the Signaling Cascades and Dynamic Regulatory Networks Controlling Stress Responses Linking the Signaling Cascades and Dynamic Regulatory Networks Controlling Stress Responses Anthony Gitter, Miri Carmi, Naama Barkai, and Ziv Bar-Joseph Supplementary Information Supplementary Results

More information

Small RNA in rice genome

Small RNA in rice genome Vol. 45 No. 5 SCIENCE IN CHINA (Series C) October 2002 Small RNA in rice genome WANG Kai ( 1, ZHU Xiaopeng ( 2, ZHONG Lan ( 1,3 & CHEN Runsheng ( 1,2 1. Beijing Genomics Institute/Center of Genomics and

More information

Analysis and visualization of protein-protein interactions. Olga Vitek Assistant Professor Statistics and Computer Science

Analysis and visualization of protein-protein interactions. Olga Vitek Assistant Professor Statistics and Computer Science 1 Analysis and visualization of protein-protein interactions Olga Vitek Assistant Professor Statistics and Computer Science 2 Outline 1. Protein-protein interactions 2. Using graph structures to study

More information

networks? networks are spaces generated by interaction

networks? networks are spaces generated by interaction networks? networks are spaces generated by interaction DNA damage response and repair DNA metabolism DNA recombination DNA replication RNA localization and processing autophagy budding carbohydrate metabolism

More information

Introduction to Bioinformatics Integrated Science, 11/9/05

Introduction to Bioinformatics Integrated Science, 11/9/05 1 Introduction to Bioinformatics Integrated Science, 11/9/05 Morris Levy Biological Sciences Research: Evolutionary Ecology, Plant- Fungal Pathogen Interactions Coordinator: BIOL 495S/CS490B/STAT490B Introduction

More information

Detecting temporal protein complexes from dynamic protein-protein interaction networks

Detecting temporal protein complexes from dynamic protein-protein interaction networks Detecting temporal protein complexes from dynamic protein-protein interaction networks Le Ou-Yang, Dao-Qing Dai, Xiao-Li Li, Min Wu, Xiao-Fei Zhang and Peng Yang 1 Supplementary Table Table S1: Comparative

More information

Supplemental Table S2. List of 166 Transcription Factors Deleted in Mutants Assayed in this Study ORF Gene Description

Supplemental Table S2. List of 166 Transcription Factors Deleted in Mutants Assayed in this Study ORF Gene Description Supplemental Table S2. List of 166 Transcription Factors Deleted in Mutants Assayed in this Study ORF Gene Description YER045C ACA1* a Basic leucine zipper (bzip) transcription factor of the ATF/CREB family,

More information

Introduction to Microarray Data Analysis and Gene Networks lecture 8. Alvis Brazma European Bioinformatics Institute

Introduction to Microarray Data Analysis and Gene Networks lecture 8. Alvis Brazma European Bioinformatics Institute Introduction to Microarray Data Analysis and Gene Networks lecture 8 Alvis Brazma European Bioinformatics Institute Lecture 8 Gene networks part 2 Network topology (part 2) Network logics Network dynamics

More information

Genome-wide protein interaction screens reveal functional networks involving Sm-like proteins

Genome-wide protein interaction screens reveal functional networks involving Sm-like proteins Yeast Yeast 2000; 17: 95±110. Research Article Genome-wide protein interaction screens reveal functional networks involving Sm-like proteins Micheline Fromont-Racine 1{, Andrew E. Mayes 2{, Adeline Brunet-Simon

More information

Table S1, Yvert et al.

Table S1, Yvert et al. Table S1, Yvert et al. List of 593 clusters of genes showing correlated expression in the cross. For each gene, 90 expression values (BY: 2 independent cultures, RM: 2 independent cultures, 86 segregants:

More information

Evolutionary Analysis by Whole-Genome Comparisons

Evolutionary Analysis by Whole-Genome Comparisons JOURNAL OF BACTERIOLOGY, Apr. 2002, p. 2260 2272 Vol. 184, No. 8 0021-9193/02/$04.00 0 DOI: 184.8.2260 2272.2002 Copyright 2002, American Society for Microbiology. All Rights Reserved. Evolutionary Analysis

More information

Screening the yeast deletant mutant collection for hypersensitivity and hyper-resistance to sorbate, a weak organic acid food preservative

Screening the yeast deletant mutant collection for hypersensitivity and hyper-resistance to sorbate, a weak organic acid food preservative Yeast Yeast 2004; 21: 927 946. Published online in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/yea.1141 Yeast Functional Analysis Report Screening the yeast deletant mutant collection

More information

Genome-wide expression screens indicate a global role for protein kinase CK2 in chromatin remodeling

Genome-wide expression screens indicate a global role for protein kinase CK2 in chromatin remodeling JCS epress online publication date 4 March 2003 Research Article 1563 Genome-wide expression screens indicate a global role for protein kinase CK2 in chromatin remodeling Thomas Barz, Karin Ackermann,

More information

Robust Sparse Estimation of Multiresponse Regression and Inverse Covariance Matrix via the L2 distance

Robust Sparse Estimation of Multiresponse Regression and Inverse Covariance Matrix via the L2 distance Robust Sparse Estimatio of Multirespose Regressio ad Iverse Covariace Matrix via the L2 distace Aurélie C. Lozao IBM Watso Research Ceter Yorktow Heights, New York aclozao@us.ibm.com Huijig Jiag IBM Watso

More information

Evolutionary Use of Domain Recombination: A Distinction. Between Membrane and Soluble Proteins

Evolutionary Use of Domain Recombination: A Distinction. Between Membrane and Soluble Proteins 1 Evolutionary Use of Domain Recombination: A Distinction Between Membrane and Soluble Proteins Yang Liu, Mark Gerstein, Donald M. Engelman Department of Molecular Biophysics and Biochemistry, Yale University,

More information

Yeast require an Intact Tryptophan Biosynthesis Pathway and Exogenous Tryptophan for Resistance to Sodium Dodecyl Sulfate

Yeast require an Intact Tryptophan Biosynthesis Pathway and Exogenous Tryptophan for Resistance to Sodium Dodecyl Sulfate Yeast require an Intact Tryptophan Biosynthesis Pathway and Exogenous Tryptophan for Resistance to Sodium Dodecyl Sulfate Laura M. Ammons 1,2, Logan R. Bingham 1,2, Sarah Callery 1,2, Elizabeth Corley

More information

Systems biology and biological networks

Systems biology and biological networks Systems Biology Workshop Systems biology and biological networks Center for Biological Sequence Analysis Networks in electronics Radio kindly provided by Lazebnik, Cancer Cell, 2002 Systems Biology Workshop,

More information

Multifractal characterisation of complete genomes

Multifractal characterisation of complete genomes arxiv:physics/1854v1 [physics.bio-ph] 28 Aug 21 Multifractal characterisation of complete genomes Vo Anh 1, Ka-Sing Lau 2 and Zu-Guo Yu 1,3 1 Centre in Statistical Science and Industrial Mathematics, Queensland

More information

Genome-Scale Gene Function Prediction Using Multiple Sources of High-Throughput Data in Yeast Saccharomyces cerevisiae ABSTRACT

Genome-Scale Gene Function Prediction Using Multiple Sources of High-Throughput Data in Yeast Saccharomyces cerevisiae ABSTRACT OMICS A Journal of Integrative Biology Volume 8, Number 4, 2004 Mary Ann Liebert, Inc. Genome-Scale Gene Function Prediction Using Multiple Sources of High-Throughput Data in Yeast Saccharomyces cerevisiae

More information

How many protein-coding genes are there in the Saccharomyces cerevisiae genome?

How many protein-coding genes are there in the Saccharomyces cerevisiae genome? Yeast Yeast 2002; 19: 619 629. Published online in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002 / yea.865 Research Article How many protein-coding genes are there in the Saccharomyces

More information

Using Networks to Integrate Omic and Semantic Data: Towards Understanding Protein Function on a Genome Scale

Using Networks to Integrate Omic and Semantic Data: Towards Understanding Protein Function on a Genome Scale Using Networks to Integrate Omic and Semantic Data: Towards Understanding Protein Function on a Genome Scale Biomarker Data Analysis.10.01, 9:25-9:50 Mark B Gerstein Yale (Comp. Bio. & Bioinformatics)

More information

Honors Thesis: Supplementary Material. LCC 4700: Undergraduate Thesis Writing

Honors Thesis: Supplementary Material. LCC 4700: Undergraduate Thesis Writing Honors Thesis: Supplementary Material LCC 4700: Undergraduate Thesis Writing The Evolutionary Impact of Functional RNA Secondary Structures within Protein- Coding Regions in Yeast Charles Warden Advisor:

More information

Clustering gene expression data using graph separators

Clustering gene expression data using graph separators 1 Clustering gene expression data using graph separators Bangaly Kaba 1, Nicolas Pinet 1, Gaëlle Lelandais 2, Alain Sigayret 1, Anne Berry 1. LIMOS/RR-07-02 13/02/2007 - révisé le 11/05/2007 1 LIMOS, UMR

More information

# shared OGs (spa, spb) Size of the smallest genome. dist (spa, spb) = 1. Neighbor joining. OG1 OG2 OG3 OG4 sp sp sp

# shared OGs (spa, spb) Size of the smallest genome. dist (spa, spb) = 1. Neighbor joining. OG1 OG2 OG3 OG4 sp sp sp Bioinformatics and Evolutionary Genomics: Genome Evolution in terms of Gene Content 3/10/2014 1 Gene Content Evolution What about HGT / genome sizes? Genome trees based on gene content: shared genes Haemophilus

More information

Localizing proteins in the cell from their phylogenetic profiles

Localizing proteins in the cell from their phylogenetic profiles Localizing proteins in the cell from their phylogenetic profiles Edward M. Marcotte*, Ioannis Xenarios*, Alexander M. van der Bliek*, and David Eisenberg* *Molecular Biology Institute, University of California

More information

Research Article Predicting Protein Complexes in Weighted Dynamic PPI Networks Based on ICSC

Research Article Predicting Protein Complexes in Weighted Dynamic PPI Networks Based on ICSC Hindawi Complexity Volume 2017, Article ID 4120506, 11 pages https://doi.org/10.1155/2017/4120506 Research Article Predicting Protein Complexes in Weighted Dynamic PPI Networks Based on ICSC Jie Zhao,

More information

CGS 5991 (2 Credits) Bioinformatics Tools

CGS 5991 (2 Credits) Bioinformatics Tools CAP 5991 (3 Credits) Introduction to Bioinformatics CGS 5991 (2 Credits) Bioinformatics Tools Giri Narasimhan 8/26/03 CAP/CGS 5991: Lecture 1 1 Course Schedules CAP 5991 (3 credit) will meet every Tue

More information

Case story: Analysis of the Cell Cycle

Case story: Analysis of the Cell Cycle DNA microarray analysis, January 2 nd 2006 Case story: Analysis of the Cell Cycle Center for Biological Sequence Analysis Outline Introduction Cell division and cell cycle regulation Experimental studies

More information

Intersection of RNA Processing and Fatty Acid Synthesis and Attachment in Yeast Mitochondria

Intersection of RNA Processing and Fatty Acid Synthesis and Attachment in Yeast Mitochondria Intersection of RNA Processing and Fatty Acid Synthesis and Attachment in Yeast Mitochondria Item Type text; Electronic Dissertation Authors Schonauer, Melissa Publisher The University of Arizona. Rights

More information

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010 BLAST Database Searching BME 110: CompBio Tools Todd Lowe April 8, 2010 Admin Reading: Read chapter 7, and the NCBI Blast Guide and tutorial http://www.ncbi.nlm.nih.gov/blast/why.shtml Read Chapter 8 for

More information

Combining Microarrays and Biological Knowledge for Estimating Gene Networks via Bayesian Networks

Combining Microarrays and Biological Knowledge for Estimating Gene Networks via Bayesian Networks Combining Microarrays and Biological Knowledge for Estimating Gene Networks via Bayesian Networks Seiya Imoto 1, Tomoyuki Higuchi 2, Takao Goto 1, Kousuke Tashiro 3, Satoru Kuhara 3 and Satoru Miyano 1

More information

AS A SERVICE TO THE RESEARCH COMMUNITY, GENOME BIOLOGY PROVIDES A 'PREPRINT' DEPOSITORY

AS A SERVICE TO THE RESEARCH COMMUNITY, GENOME BIOLOGY PROVIDES A 'PREPRINT' DEPOSITORY http://genomebiology.com/2002/3/12/preprint/0011.1 This information has not been peer-reviewed. Responsibility for the findings rests solely with the author(s). Deposited research article MRD: a microsatellite

More information

Supplemental Materials

Supplemental Materials JOURNAL OF MICROBIOLOGY & BIOLOGY EDUCATION, May 2013, p. 107-109 DOI: http://dx.doi.org/10.1128/jmbe.v14i1.496 Supplemental Materials for Engaging Students in a Bioinformatics Activity to Introduce Gene

More information

A network of protein protein interactions in yeast

A network of protein protein interactions in yeast A network of protein protein interactions in yeast Benno Schwikowski 1,2 *, Peter Uetz 3, and Stanley Fields 3,4 1 The Institute for Systems Biology, 4225 Roosevelt Way NE, Suite 200, Seattle, WA 98105.

More information

ABSTRACT. As a result of recent successes in genome scale studies, especially genome

ABSTRACT. As a result of recent successes in genome scale studies, especially genome ABSTRACT Title of Dissertation / Thesis: COMPUTATIONAL ANALYSES OF MICROBIAL GENOMES OPERONS, PROTEIN FAMILIES AND LATERAL GENE TRANSFER. Yongpan Yan, Doctor of Philosophy, 2005 Dissertation / Thesis Directed

More information

Genome Annotation. Bioinformatics and Computational Biology. Genome sequencing Assembly. Gene prediction. Protein targeting.

Genome Annotation. Bioinformatics and Computational Biology. Genome sequencing Assembly. Gene prediction. Protein targeting. Genome Annotation Bioinformatics and Computational Biology Genome Annotation Frank Oliver Glöckner 1 Genome Analysis Roadmap Genome sequencing Assembly Gene prediction Protein targeting trna prediction

More information

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Evaluation. Course Homepage.

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Evaluation. Course Homepage. CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools Giri Narasimhan ECS 389; Phone: x3748 giri@cis.fiu.edu www.cis.fiu.edu/~giri/teach/bioinfs06.html 1/12/06 CAP5510/CGS5166 1 Evaluation

More information

UDC Comparative Analysis of Methods Recognizing Potential Transcription Factor Binding Sites

UDC Comparative Analysis of Methods Recognizing Potential Transcription Factor Binding Sites Molecular Biology, Vol. 35, No. 6, 2001, pp. 818 825. Translated from Molekulyarnaya Biologiya, Vol. 35, No. 6, 2001, pp. 961 969. Original Russian Text Copyright 2001 by Pozdnyakov, Vityaev, Ananko, Ignatieva,

More information

Fitness constraints on horizontal gene transfer

Fitness constraints on horizontal gene transfer Fitness constraints on horizontal gene transfer Dan I Andersson University of Uppsala, Department of Medical Biochemistry and Microbiology, Uppsala, Sweden GMM 3, 30 Aug--2 Sep, Oslo, Norway Acknowledgements:

More information

Comparative genomics: Overview & Tools + MUMmer algorithm

Comparative genomics: Overview & Tools + MUMmer algorithm Comparative genomics: Overview & Tools + MUMmer algorithm Urmila Kulkarni-Kale Bioinformatics Centre University of Pune, Pune 411 007. urmila@bioinfo.ernet.in Genome sequence: Fact file 1995: The first

More information

(Fold Change) acetate biosynthesis ALD6 acetaldehyde dehydrogenase 3.0 acetyl-coa biosynthesis ACS2 acetyl-coa synthetase 2.5

(Fold Change) acetate biosynthesis ALD6 acetaldehyde dehydrogenase 3.0 acetyl-coa biosynthesis ACS2 acetyl-coa synthetase 2.5 Supplementary Table 1 Overexpression strains that showed resistance or hypersensitivity to rapamycin. Strains that were more than two-fold enriched on average (and at least 1.8-fold enriched in both experiment

More information

Identifying Correlations between Chromosomal Proximity of Genes and Distance of Their Products in Protein-Protein Interaction Networks of Yeast

Identifying Correlations between Chromosomal Proximity of Genes and Distance of Their Products in Protein-Protein Interaction Networks of Yeast Identifying Correlations between Chromosomal Proximity of Genes and Distance of Their Products in Protein-Protein Interaction Networks of Yeast Daniele Santoni 1 *, Filippo Castiglione 2, Paola Paci 1

More information

EXTRACTING GLOBAL STRUCTURE FROM GENE EXPRESSION PROFILES

EXTRACTING GLOBAL STRUCTURE FROM GENE EXPRESSION PROFILES EXTRACTING GLOBAL STRUCTURE FROM GENE EXPRESSION PROFILES Charless Fowlkes 1, Qun Shan 2, Serge Belongie 3, and Jitendra Malik 1 Departments of Computer Science 1 and Molecular Cell Biology 2, University

More information

Network by Weighted Graph Mining

Network by Weighted Graph Mining 2012 4th International Conference on Bioinformatics and Biomedical Technology IPCBEE vol.29 (2012) (2012) IACSIT Press, Singapore + Prediction of Protein Function from Protein-Protein Interaction Network

More information

INTERACTIVE CLUSTERING FOR EXPLORATION OF GENOMIC DATA

INTERACTIVE CLUSTERING FOR EXPLORATION OF GENOMIC DATA INTERACTIVE CLUSTERING FOR EXPLORATION OF GENOMIC DATA XIUFENG WAN xw6@cs.msstate.edu Department of Computer Science Box 9637 JOHN A. BOYLE jab@ra.msstate.edu Department of Biochemistry and Molecular Biology

More information

Computational methods for the analysis of bacterial gene regulation Brouwer, Rutger Wubbe Willem

Computational methods for the analysis of bacterial gene regulation Brouwer, Rutger Wubbe Willem University of Groningen Computational methods for the analysis of bacterial gene regulation Brouwer, Rutger Wubbe Willem IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's

More information

Received 17 January 2005/Accepted 22 February 2005

Received 17 January 2005/Accepted 22 February 2005 EUKARYOTIC CELL, May 2005, p. 849 860 Vol. 4, No. 5 1535-9778/05/$08.00 0 doi:10.1128/ec.4.5.849 860.2005 Copyright 2005, American Society for Microbiology. All Rights Reserved. A Two-Hybrid Screen of

More information

BMD645. Integration of Omics

BMD645. Integration of Omics BMD645 Integration of Omics Shu-Jen Chen, Chang Gung University Dec. 11, 2009 1 Traditional Biology vs. Systems Biology Traditional biology : Single genes or proteins Systems biology: Simultaneously study

More information

ppendix E - Growth Rates of Individual Genes on Various Non-Fementable Carbon So ORF Gene Lactate Glycerol EtOH Description

ppendix E - Growth Rates of Individual Genes on Various Non-Fementable Carbon So ORF Gene Lactate Glycerol EtOH Description ppendix E - Growth Rates of Individual Genes on Various Non-Fementable Carbon So ORF Gene Lactate Glycerol EtOH Description YAL1C MDM1 2 2 Mitochondrial outer membrane protein involved in mitochondrial

More information

Procedure to Create NCBI KOGS

Procedure to Create NCBI KOGS Procedure to Create NCBI KOGS full details in: Tatusov et al (2003) BMC Bioinformatics 4:41. 1. Detect and mask typical repetitive domains Reason: masking prevents spurious lumping of non-orthologs based

More information

Genomes and Their Evolution

Genomes and Their Evolution Chapter 21 Genomes and Their Evolution PowerPoint Lecture Presentations for Biology Eighth Edition Neil Campbell and Jane Reece Lectures by Chris Romero, updated by Erin Barley with contributions from

More information

BLAST. Varieties of BLAST

BLAST. Varieties of BLAST BLAST Basic Local Alignment Search Tool (1990) Altschul, Gish, Miller, Myers, & Lipman Uses short-cuts or heuristics to improve search speed Like speed-reading, does not examine every nucleotide of database

More information

Measure representation and multifractal analysis of complete genomes

Measure representation and multifractal analysis of complete genomes PHYSICAL REVIEW E, VOLUME 64, 031903 Measure representation and multifractal analysis of complete genomes Zu-Guo Yu, 1,2, * Vo Anh, 1 and Ka-Sing Lau 3 1 Centre in Statistical Science and Industrial Mathematics,

More information

Compositional Correlation for Detecting Real Associations. Among Time Series

Compositional Correlation for Detecting Real Associations. Among Time Series Compositional Correlation for Detecting Real Associations Among Time Series Fatih DIKBAS Civil Engineering Department, Pamukkale University, Turkey Correlation remains to be one of the most widely used

More information

Spatio-temporal dynamics of yeast mitochondrial biogenesis: transcriptional and post-transcriptional. mrna oscillatory modules.

Spatio-temporal dynamics of yeast mitochondrial biogenesis: transcriptional and post-transcriptional. mrna oscillatory modules. Spatio-temporal dynamics of yeast mitochondrial biogenesis: transcriptional and post-transcriptional mrna oscillatory modules. Gaëlle Lelandais, Yann Saint-Georges, Colette Geneix, Liza Al-Shikhley, Geneviève

More information

The yeast interactome (unit: g303204)

The yeast interactome (unit: g303204) The yeast interactome (unit: g303204) Peter Uetz 1 & Andrei Grigoriev 2 1 Institute of Genetics (ITG), Forschungszentrum Karlsruhe, Karlsruhe, Germany 2 GPC Biotech, Martinsried, Germany Addresses 1 Institute

More information

An Introduction to Sequence Similarity ( Homology ) Searching

An Introduction to Sequence Similarity ( Homology ) Searching An Introduction to Sequence Similarity ( Homology ) Searching Gary D. Stormo 1 UNIT 3.1 1 Washington University, School of Medicine, St. Louis, Missouri ABSTRACT Homologous sequences usually have the same,

More information

Frequently Asked Questions (FAQs)

Frequently Asked Questions (FAQs) Frequently Asked Questions (FAQs) Q1. What is meant by Satellite and Repetitive DNA? Ans: Satellite and repetitive DNA generally refers to DNA whose base sequence is repeated many times throughout the

More information

Grundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson

Grundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson Grundlagen der Bioinformatik, SS 10, D. Huson, April 12, 2010 1 1 Introduction Grundlagen der Bioinformatik Summer semester 2010 Lecturer: Prof. Daniel Huson Office hours: Thursdays 17-18h (Sand 14, C310a)

More information

Computational Biology: Basics & Interesting Problems

Computational Biology: Basics & Interesting Problems Computational Biology: Basics & Interesting Problems Summary Sources of information Biological concepts: structure & terminology Sequencing Gene finding Protein structure prediction Sources of information

More information

Related Courses He who asks is a fool for five minutes, but he who does not ask remains a fool forever.

Related Courses He who asks is a fool for five minutes, but he who does not ask remains a fool forever. CSE 527 Computational Biology http://www.cs.washington.edu/527 Lecture 1: Overview & Bio Review Autumn 2004 Larry Ruzzo Related Courses He who asks is a fool for five minutes, but he who does not ask remains

More information

Bayesian inference for stochastic kinetic models. intracellular reaction networks

Bayesian inference for stochastic kinetic models. intracellular reaction networks for stochastic models of intracellular reaction networks Darren Wilkinson School of Mathematics & Statistics and Centre for Integrated Systems Biology of Ageing and Nutrition Newcastle University, UK IMA

More information

Hub Gene Selection Methods for the Reconstruction of Transcription Networks

Hub Gene Selection Methods for the Reconstruction of Transcription Networks for the Reconstruction of Transcription Networks José Miguel Hernández-Lobato (1) and Tjeerd. M. H. Dijkstra (2) (1) Computer Science Department, Universidad Autónoma de Madrid, Spain (2) Institute for

More information

HOBACGEN: Database System for Comparative Genomics in Bacteria

HOBACGEN: Database System for Comparative Genomics in Bacteria Resource HOBACGEN: Database System for Comparative Genomics in Bacteria Guy Perrière, 1 Laurent Duret, and Manolo Gouy Laboratoire de Biométrie et Biologie Évolutive, Unité Mixte de Recherche Centre National

More information

Analysis of Yeast s ORF Upstream Regions by Parallel Processing, Microarrays, and Computational Methods

Analysis of Yeast s ORF Upstream Regions by Parallel Processing, Microarrays, and Computational Methods Analysis of Yeast s ORF Upstream Regions by Parallel Processing, Microarrays, and Computational Methods Steven Hampson Dept. of Information and Computer Science University of California, Irvine Irvine,

More information

Transcription Regulation and Gene Expression in Eukaryotes FS08 Pharmacenter/Biocenter Auditorium 1 Wednesdays 16h15-18h00.

Transcription Regulation and Gene Expression in Eukaryotes FS08 Pharmacenter/Biocenter Auditorium 1 Wednesdays 16h15-18h00. Transcription Regulation and Gene Expression in Eukaryotes FS08 Pharmacenter/Biocenter Auditorium 1 Wednesdays 16h15-18h00. Promoters and Enhancers Systematic discovery of transcriptional regulatory motifs

More information

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Homology Modeling. Roberto Lins EPFL - summer semester 2005 Homology Modeling Roberto Lins EPFL - summer semester 2005 Disclaimer: course material is mainly taken from: P.E. Bourne & H Weissig, Structural Bioinformatics; C.A. Orengo, D.T. Jones & J.M. Thornton,

More information

Bacterial genome chimaerism and the origin of mitochondria

Bacterial genome chimaerism and the origin of mitochondria 49 Bacterial genome chimaerism and the origin of mitochondria Ankur Abhishek, Anish Bavishi, Ashay Bavishi, and Madhusudan Choudhary Abstract: Many studies have sought to determine the origin and evolution

More information

Large-scale phenotypic analysis reveals identical contributions to cell functions of known and unknown yeast genes

Large-scale phenotypic analysis reveals identical contributions to cell functions of known and unknown yeast genes Yeast Yeast 2001; 18: 1397 1412. DOI: 10.1002 /yea.784 Yeast Functional Analysis Report Large-scale phenotypic analysis reveals identical contributions to cell functions of known and unknown yeast genes

More information

6.096 Algorithms for Computational Biology. Prof. Manolis Kellis

6.096 Algorithms for Computational Biology. Prof. Manolis Kellis 6.096 Algorithms for Computational Biology Prof. Manolis Kellis Today s Goals Introduction Class introduction Challenges in Computational Biology Gene Regulation: Regulatory Motif Discovery Exhaustive

More information

In-Depth Assessment of Local Sequence Alignment

In-Depth Assessment of Local Sequence Alignment 2012 International Conference on Environment Science and Engieering IPCBEE vol.3 2(2012) (2012)IACSIT Press, Singapoore In-Depth Assessment of Local Sequence Alignment Atoosa Ghahremani and Mahmood A.

More information

Defining the Budding Yeast Chromatin Associated Interactome

Defining the Budding Yeast Chromatin Associated Interactome Supplementary Data Defining the Budding Yeast Chromatin Associated Interactome Jean-Philippe Lambert 1, Jeffrey Fillingham 2,3, Mojgan Siahbazi 1, Jack Greenblatt 3, Kristin Baetz 1, Daniel Figeys 1 1

More information

Growth of Yeast, Saccharomyces cerevisiae, under Hypergravity Conditions

Growth of Yeast, Saccharomyces cerevisiae, under Hypergravity Conditions Syracuse University SURFACE Syracuse University Honors Program Capstone Projects Syracuse University Honors Program Capstone Projects Spring 5-1-2011 Growth of Yeast, Saccharomyces cerevisiae, under Hypergravity

More information

AN ALGORITHM FOR IDENTIFYING CLUSTERS OF FUNCTIONALLY RELATED GENES IN GENOMES. A Thesis GANG MAN YI

AN ALGORITHM FOR IDENTIFYING CLUSTERS OF FUNCTIONALLY RELATED GENES IN GENOMES. A Thesis GANG MAN YI AN ALGORITHM FOR IDENTIFYING CLUSTERS OF FUNCTIONALLY RELATED GENES IN GENOMES A Thesis by GANG MAN YI Submitted to the Office of Graduate Studies of Texas A&M University in partial fulfillment of the

More information

Hierarchical modelling of automated imaging data

Hierarchical modelling of automated imaging data Darren Wilkinson http://tinyurl.com/darrenjw School of Mathematics & Statistics, Newcastle University, UK Theory of Big Data 2 UCL, London 6th 8th January, 2016 Overview Background: Budding yeast as a

More information

Regulatory Sequence Analysis. Sequence models (Bernoulli and Markov models)

Regulatory Sequence Analysis. Sequence models (Bernoulli and Markov models) Regulatory Sequence Analysis Sequence models (Bernoulli and Markov models) 1 Why do we need random models? Any pattern discovery relies on an underlying model to estimate the random expectation. This model

More information

The Eukaryotic Genome and Its Expression. The Eukaryotic Genome and Its Expression. A. The Eukaryotic Genome. Lecture Series 11

The Eukaryotic Genome and Its Expression. The Eukaryotic Genome and Its Expression. A. The Eukaryotic Genome. Lecture Series 11 The Eukaryotic Genome and Its Expression Lecture Series 11 The Eukaryotic Genome and Its Expression A. The Eukaryotic Genome B. Repetitive Sequences (rem: teleomeres) C. The Structures of Protein-Coding

More information

Correlations between Shine-Dalgarno Sequences and Gene Features Such as Predicted Expression Levels and Operon Structures

Correlations between Shine-Dalgarno Sequences and Gene Features Such as Predicted Expression Levels and Operon Structures JOURNAL OF BACTERIOLOGY, Oct. 2002, p. 5733 5745 Vol. 184, No. 20 0021-9193/02/$04.00 0 DOI: 10.1128/JB.184.20.5733 5745.2002 Copyright 2002, American Society for Microbiology. All Rights Reserved. Correlations

More information

Introduction to Bioinformatics. Shifra Ben-Dor Irit Orr

Introduction to Bioinformatics. Shifra Ben-Dor Irit Orr Introduction to Bioinformatics Shifra Ben-Dor Irit Orr Lecture Outline: Technical Course Items Introduction to Bioinformatics Introduction to Databases This week and next week What is bioinformatics? A

More information

Percentage of genomic features annotated in the S. cerevisiae reference genome covered by different S. cerevisiae assemblies.

Percentage of genomic features annotated in the S. cerevisiae reference genome covered by different S. cerevisiae assemblies. Supplementary Figure 1 Percentage of genomic features annotated in the S. cerevisiae reference genome covered by different S. cerevisiae assemblies. Our PacBio assembly (strain S288C, underscored) shows

More information