DOOR: a prokaryotic operon database for genome analyses and functional inference

Size: px

Start display at page:

Download "DOOR: a prokaryotic operon database for genome analyses and functional inference"

Lucinda Golden
6 years ago
Views:

1 Briefings in Bioinformatics, 2017, 1 8 doi: /bib/bbx088 Software Review DOOR: a prokaryotic operon database for genome analyses and functional inference Huansheng Cao*, Qin Ma*, Xin Chen and Ying Xu Corresponding author: Ying Xu, Department of Biochemistry and Molecular Biology, Institute of Bioinformatics, University of Georgia, Athens, GA 30602, USA. Tel xyn@uga.edu *These authors contributed equally to this work. Abstract The rapid accumulation of fully sequenced prokaryotic genomes provides unprecedented information for biological studies of bacterial and archaeal organisms in a systematic manner. Operons are the basic functional units for conducting such studies. Here, we review an operon database DOOR (the Database of prokaryotic OpeRons) that we have previously developed and continue to update. Currently, the database contains computationally predicted operons in 2072 complete genomes. In addition, the database also contains the following information: (i) transcriptional units for 24 genomes derived using publicly available transcriptomic data; (ii) orthologous gene mapping across genomes; (iii) 6408 cisregulatory motifs for transcriptional factors of some operons for 203 genomes; (iv) Rho-independent terminators for 2072 genomes; as well as (v) a suite of tools in support of applications of the predicted operons. In this review, we will explain how such data are computationally derived and demonstrate how they can be used to derive a wide range of higherlevel information needed for systems biology studies to tackle complex and fundamental biology questions. Key words: DOOR; operon database; transcription unit; genomic organization; gene regulation Introduction The rapid advancement of genome-scale sequencing techniques in the past decade has made it possible to have a microbial genome sequenced in a highly efficient and affordable manner. It has now become routine to have such genomes sequenced by individual labs. To date, 6917 complete genomes and draft genomes have been sequenced and organized into the NCBI Genome database. In addition, 5415 complete prokaryotic genomes and draft genomes are stored at the JGI Integrated Microbial Genome and Microbiome Samples (JGI IMG/M) database ( [1]. Over 175 million genes have been computationally predicted in the sequenced prokaryotic genomes, with 171 million (97.7%) being protein-encoding genes and 4 million (2.3%) RNA genes. These genomic data have enabled studies of a wide range of complex and previously unsolvable questions in a systematic manner, such as inferring and modeling the dynamic behaviors of regulatory networks in prokaryotic organisms. These studies become possible because of the unique property of prokaryotes, functionally closely related genes tend to be organized into operons, i.e. clusters of genes arranged in tandem on the same strand of a genome, which share a common promoter and terminator [2]. It is noteworthy to point out that the concept of operon has evolved since the discovery and characterization of the lac operon in Escherichia coli by French scientists Jacob and Monod in 1960 [2]. Based on the original definition, an extra technical criterion has been added to operon: operons encoded in a genome do not overlap with each other [3, 4]. This simplified definition Huansheng Cao is a Postdoctoral Researcher in the Department of Biochemistry and Molecular Biology and Institute of Bioinformatics at the University of Georgia. His research interests include microbial systems biology. Qin Ma is an Assistant Professor in the Department of Agronomy, Horticulture and Plant Science at South Dakota State University. He is interested in bioinformatic algorithms/tools development in transcriptional regulation. Xin Chen is an Assistant Professor in the Center for Applied Mathematics at Tianjin University, China; his research interests include microbial and plant genome analysis. Ying Xu is the Regents-GRA Eminent Scholar Chair Professor in the Department of Biochemistry and Molecular Biology and Institute of Bioinformatics of the University of Georgia. His interests include microbial and cancer systems biology. Submitted: 2 May 2017; Received (in revised form): 13 June 2017 VC The Author Published by Oxford University Press. All rights reserved. For Permissions, please journals.permissions@oup.com 1

2 2 Cao et al. has enabled computational scientists to predict operons based on genomic sequence(s) alone [5]. Therefore, all the complete prokaryotic genomes have their operons predicted. However, the downside is: these computationally predicted operons only represent a small subset of all the operons encoded in a genome. With the transcriptomic data becoming increasingly available for prokaryotes, researchers have realized that an operon may have different variations in terms of its component genes expressed under different conditions, termed transcriptional units (TUs) [6]. Clearly, full elucidation of all the TUs of an operon is considerably more challenging than sequence-based prediction of operons, as they are condition-dependent. Currently, there are no published algorithms for TU prediction based on genomic sequences alone, as such capability needs to be able to accurately identify promoter terminator pairs that are used under varying conditions. TUs are now being identified through analyses of transcriptomic data collected under a few conditions only for a small number of genomes [7]. The current version of DOOR (the Database of prokaryotic OpeRons) consists of both types of operons: operons predicted based on sequences and limited TUs identified using transcriptomic data [8]. Because of serving as the basic functional and TUs as well as the backbones of TUs, operons are important elements for the study of gene regulation and functional adjustment to different environments. For example, some important pathways tend to be organized into a few transcriptionally co-regulated operons [9, 10]. Through these tightly co-expressed operons, the regulatory motifs can be identified, which in turn will reveal global regulatory circuits. Besides regulation of operons on a genome scale, comparative studies of these functional units can reveal pathway conservation and operon evolution among prokaryotes [11, 12], and provide more accurate prediction of orthologous genes across different prokaryotic genomes [11], compared with gene-centric methods for ortholog mapping. Furthermore, the positions of the operons in a genome and their functional association provide a unique and practical angle to demonstrate the global organizing principle in these organisms [13]. In this article, we review the status of the DOOR database in terms of its basic content, as well as a suite of tools that enable the user to fully use the predicted operons and identified TUs for functional studies of a prokaryotic organism. Some major applications through some of the third-party work that has used DOOR are summarized. On this basis of DOOR and DOORsupporting tools, we will introduce operon regulation, evolution and organization on local and global scale in prokaryotes. The entire contents covered are illustrated in Figure 1. The DOOR database Operon databases and DOOR Owing to this fundamental importance, several operon databases have been developed and deployed for public service, including OperonDB [14], ProOpDB [15], ODB [16], rrndb [17] and DOOR [8, 18]. The first DOOR database was developed in Its operon prediction algorithm looks for a maximal sequence of adjacent genes on the same DNA strand without disruption on the opposite strand such that each pair of the adjacent genes has conserved neighboring relationship in some other genomes; their intergenic distances are within some threshold; and are functionally related [8, 18]. This simple working definition of operon has made the prediction both reliable and stable. Specifically, the core of the algorithm is a classification method. The program classifies each pair of adjacent genes into two classes: in or not in the same operon, using five features: the intergenic distance, conservation level of the two genes in the same neighborhood across other genomes, functional relatedness measured using phylogenetic distances between the two genes, the ratio between the lengths of the two genes and frequencies of certain predefined DNA motifs in their intergenic region [5]. Among these features, the intergenic distance is the highest discerning feature in predicting if a pair of adjacent genes is in the same operon. The training of the classifier was done on experimentally validated operons in a few organisms, including E. coli and Bacillus subtilis. The operon prediction pipeline is given in Figure 2. Two separate classifiers are trained and used depending on whether the target genome has a substantial number of experimentally identified operons. For those with such identified operons, a nonlinear decision tree-based classifier is used to make the prediction. When tested on B. subtilis and E. coli genomes, the prediction program achieved accuracy levels at 90.2 and 93.7%, respectively [5]. For genomes without such operons, a linear logistic classifier is found to be most reliable, with the minimum false-positive rates [5]. When tested on E. coli and B. subtilis without applying their operon information, this predictor has prediction accuracies at 84.6 and 83.3%, respectively [7]. The prediction program of DOOR was ranked as the best operon predictor by an independent review in 2013 [19] in terms of prediction accuracy [18]. The first version of DOOR had operons for 675 complete prokaryotic genomes, totaling operons, predicted using the above algorithms. The DOOR database supports the following tools to enable a user to use its operons for functional studies: (i) a search tool allowing a user to find operons by gene names and operon size in desired genomes; (ii) a prediction capability for cis-regulatory binding sites using a combination of two predictors MEME [20] and CUBIC [21], which enables a user to search for potentially co-regulated operons in a genome; (iii) general statistics about operons encoded in each genome along with the relevant literature; and (iv) an operonwiki to facilitate interactions between a user and the developer [18]. DOOR2: more genomes and transcriptomic data With the rapid increase in the complete prokaryotic genomes and the growing popularity of DOOR, we developed an updated version of the database, DOOR2, in 2013 [8]. Two major changes are made in this version. First, the number of genomes is increased to 2072, three times of the original size, which consists of 1939 complete bacterial genomes and 133 complete archaeal genomes, covering 2205 chromosomes and 1645 plasmids [8]. For these genomes, multigene operons is predicted, averaging 583 such operons per chromosome and 24 operons per plasmid, along with single-gene operons. In addition, the database has 6408 verified cis-regulatory binding motifs for transcription factors spanning 203 genomes, and Rho-independent terminators for 2072 genomes. Furthermore, the database has pairs of conserved operons across different genomes. Functionally, housekeeping biological processes such as cell division, translation, electron transport chain for aerobic/anaerobic respiration, ATP production and thiamine biosynthesis have the highest percentage of conserved operons, as exemplified in Figure 3. Given the increasing availability of RNA-seq data for complete genomes, DOOR2 also includes such data from the public domain, which can be used to elucidate TUs encoded in a

(C) TUs for DOOR operons, derived based on available transcriptomic data. (D) Co-expressed genes and operons predicted through bi-clustering analyses.

3 DOOR 3 Figure 1. The DOOR operon database and information derivable from the database. (A) A schematic for consecutive operons in two genomes (blue and magenta), which are further grouped into uber-operons. (B) Predicted cis-regulatory motifs for operons. (C) TUs for DOOR operons, derived based on available transcriptomic data. (D) Co-expressed genes and operons predicted through bi-clustering analyses. (E) New knowledge regarding an organizing principle that governs the global arrangement of operons in a genome. Figure 2. The computational pipeline for operon prediction in DOOR. genome. Currently, DOOR2 contains predicted TUs for 24 genomes, which we expect to increase significantly in the next release of the database. Like the previous version, DOOR2 provides the following tools in support of basic and advanced applications of its operon information: (i) a predictor for TUs based on given operons and RNA-seq data of a specified genome, in support of inference of the dynamic variants of the operons under specified conditions; (ii) a computer program for mapping orthologous genes across genomes, in support of analyses of operon evolution and biological pathway mapping across genomes; (iii) a search capability, developed using MySQL, to enable a user to retrieve information stored in the database, including operons, TUs, cis-regulatory motifs and terminators for individual operons and conserved operons across multiple genomes using queries containing terms such as species name, operon ID and/or gene name; (iv) operon predictors as described in the previous section, which allows a user to predict operons in a genome that is not currently in the DOOR2 database; and (v) a genome browser is also developed to support visualization of selected operons along with TU structures under multiple conditions, if such data are available [8]. For operon prediction in a specified genome, DOOR2 requires three input files: a file containing gene locations in the target genome (*.gff or *.gtf file), protein sequences (*.faa file) and the nucleotide sequence (*.fna file) of the entire genome, where the three files are used to derive the genomic features needed to train our classification model (Figure 2). For output, DOOR2 automatically sends an to the user once the computing job is done, which contains a link to the results page, containing various information as shown in Figure 4. DOOR2 has been widely applied by researchers in a variety of areas. Overall, three types of technical applications have been frequently used by users of the database and tools: operon data retrieval, operon prediction for newly sequenced genomes and utility tools for new information discovery (Figure 5). These applications also span a wide range of research areas. For example, a specific application area is in biomedical research. In this regard, DOOR is usually used for prediction or retrieval for operons involved in virulence [22] or regulation [23]. Besides these individual third-party applications, we introduce the general applications of our DOOR-based tools in further deriving biology knowledge from available prokaryotic genomes in the following sections. From genomic structures to functional inference With the rapid accumulation of large quantities of gene expression data generated for various prokaryotes, researchers can study the functionalities of the predicted operons in a systematic manner. Here, we highlight a few examples to demonstrate how functional analyses can be done through integrated

4 4 Cao et al. Figure 3. The main biological functions of conserved operons in E. coli. The scatterplot shows the cluster representatives based on the GO terms functional similarities, where each bubble represents a GO Biological Function and overlapping bubbles are for overlapping GO functions (e.g. the electron transport chain for aerobic/anaerobic respiration). The size of a bubble represents the level of a term in the GO s functional hierarchy, specifically the higher a term in the hierarchy, the larger the bubble, and the color of a bubble indicates the P-value. analyses of operons and gene expression data, including TU prediction, cis-regulatory motif prediction and analysis and co-expression analyses of operons (Figure 1B E). TU prediction based on gene expression data We have developed a program, SeqTU, for predicting TUs based on provided RNA-seq data (Figure 1B and C) [6]. SeqTU predicts the genomic boundaries of TUs based on two features measuring the RNA-seq expression patterns across a target genome: expression-level continuity and variance. Intuitively, two consecutive genes in the same TU should not (i) have substantial gaps in the expression profile across the intergenic region between the two genes, based on the definition of a TU; and (ii) have large variations in their gene expression levels across the two coding regions. SeqTU predicts a pair of consecutive genes on the same genomic strand to be in the same TU if both conditions are met [6]. A total of 2590 TUs have been predicted based on available RNA-seq using SeqTU [6, 7] and stored in DOOR2. In total of 44% of these TUs have multiple genes. Figure 4B shows one example of multiple TUs of the same E. coli operon. SeqTU has a training program that allows users to train an organism-specific TU predictor based on user-provided gene expression data. The software is now part of the DOOR2 package [8]. Prediction and application of cis-regulatory motifs We developed a computational algorithm BOBRO for the prediction of cis-regulatory binding sites in prokaryotic genomes [24] and implemented it as a Web server DMINDA [25, 26]. This Webbased tool has been integrated into the DOOR2 package in support of cis-regulatory motif prediction for identified operons and TUs (Figure 1B). Such a capability has proved to be essential for elucidation of transcription regulatory networks encoded in a prokaryotic genome. The cis-motif prediction program runs as follows: it first assesses the possibility of each position in a promoter region to be the start of a cis-regulatory motif, and it then distinguishes the actual conserved motifs from the accidental ones based on the concept of motif closure [24]. These two steps make BOBRO substantially sensitive and selective in conserved motif identification against a noisy background.

(C) A display of validated or predicted transcription factor binding sites (the left bottom) and Rho-independent terminators (on the right). (D) conserved operons. Figure 5.

5 DOOR 5 Figure 4. Displays of computational results by DOOR. (A) A screenshot of a display window. (B) A display of TUs, with the red bars representing genes, the first row of the blue bars representing multigene operons and the following rows of blue bars being TUs under different conditions. (C) A display of validated or predicted transcription factor binding sites (the left bottom) and Rho-independent terminators (on the right). (D) conserved operons. Figure 5. The usage of DOOR and DOOR2, measured by the number of citations, and the types of applications of the DOOR information by third-party publications. A key application of such a prediction capability is to identify operons or TUs that share conserved cis-regulatory binding sites in their promoter regions, for prediction of transcriptionally coregulated and possibly functionally related operons. These operons or TUs constitute a regulon regulated by a transcription factor that binds to such conserved motifs, which may be involved in functionally associated pathways in response to specific conditions. To further support such applications, we have integrated a new feature: identification of co-occurring cisregulatory motifs in the same promoter region, into the motif prediction package. This feature enables elucidation of coregulated operons/tus by multiple transcription factors [27, 28].

6 Cao et al. Figure 6. (A) Co-expressed genes/operons in two bi-clusters identified in E. coli under a stationary growth condition. (B) Co-expression networks among bi-clusters.

6 6 Cao et al. Figure 6. (A) Co-expressed genes/operons in two bi-clusters identified in E. coli under a stationary growth condition. (B) Co-expression networks among bi-clusters. Green nodes represent genes in bi-cluster #1 and red nodes for genes in bi-cluster #2. The larger a node, the higher its degree of presence in the co-expression network, and the thicker an edge, the higher the co-expression level. Co-expression analyses of operons A complementary approach to inference of transcriptionally coregulated operons and TUs to the above motif-based method is through identification of co-expressions of such operons. Knowing that transcriptional co-regulation of operons/tus are generally condition-dependent, we only aim to find operons/ TUs that are co-expressed under some but not necessarily all conditions. This requires a class of more powerful clustering techniques than the traditional clustering methods, namely, biclustering algorithms, also referred to as two-dimensional clustering [29, 30]. These techniques should be able to detect groups of operons/tus, whose correlation (co-expression) exists under (to be identified) subsets of all conditions. We have previously developed a bi-clustering algorithm QUBIC using a graph-theoretical approach [31 33] (Figure 1D). Compared with similar bi-clustering programs, QUBIC solves the general bi-clustering problems more efficiently, and is capable to solving problems with tens of thousands of genes and up to thousands of conditions in a few minutes of the CPU time on a desktop computer. This program will be integrated into the next version of the DOOR database (under development). Figure 6 shows an example of two large bi-clusters (Figure 6A) and their levels of co-expressions (Figure 6B). Operon evolution and global genomic organization Here, we show four examples to illustrate how operon data can be used to derive higher-level and more complex information encoded in prokaryotic genomes. Uber-operons: the footprint of operon evolution By studying conservation among groups of operons that are functionally related, we found that some unions of operons have conserved gene list across genomes, with each such union of operons called a uber-operon [34]. We have previously developed an algorithm for uber-operon prediction through identifying two groups of functionally related operons in a target genome versus a set of reference genomes, whose gene sets are conserved between the target and the reference genomes [12]. Using this algorithm, we have predicted 158 uber-operons containing 1830 genes (40% of all genes) in E. coli K12 against 90 reference genomes [12]. Interestingly, the level of functional relatedness among genes in the same uber-operons is higher than that among genes in the same regulons but lower than that among genes in the same operons in E. coli K12. The functional relatedness is measured based on closeness of KEGG pathways and path distance of gene ontology (GO) terms of the genes [35]. These different levels of functional relatedness between gene scopes suggest that genes in the same uber-operons are functionally related. As the component operons in the same uberoperons may not necessarily be transcriptionally co-regulated, uber-operons provide a novel way to derive functionally associated operons, such as pathways, independent of regulons [12], each of which represents a collection of transcriptionally coregulated operons. Orthologous gene identification based on operons Accurate identification of orthologous genes across genomes is the most basic step in comparative genomics but remains to be solved. We developed a computer tool GOST [11] for orthologous gene mapping across prokaryotic genomes. Compared with all the earlier methods that solely rely on gene sequence similarity, GOST identifies orthologous genes among homologous genes with an additional criterion: orthologous genes between two genomes should have homologous working partners in their respective genomes. Two genes in a genome are considered as working partners if they belong to a common operon or uberoperon [34, 35]. Figure 7 shows a few examples of accurately identified orthologous genes across two genomes [12]. Pathway mapping based on conservation of operons across genomes Accurate mapping of known biological pathways from one prokaryotic organism to another is an important issue when annotating newly sequenced prokaryotic genomes. Existing methods based on gene mapping through sequence homology often leads to incorrect results because sequence similarities alone do not contain sufficient information for correct identification of orthologs. We have developed an algorithm, P-MAP, for pathway mapping across genomes of different prokaryotic species, which integrates both sequence similarity and genomic synteny information defined by operons [36, 37]. Specifically, the algorithm identifies orthologous genes across genomes among

7 DOOR 7 Summary and outlook Operons and TUs are the basic functional units in prokaryotes. Their accurate identification serves as the basis of a wide range of biological studies of prokaryotes. We have developed and maintained an operon database DOOR in the past decade as our service to the research community of bacterial and archaeal organisms. Here, we reviewed the information stored in the DOOR database along with a suite of tools associated with the database in support of operon-based analyses and information discovery. In addition, we have showed a few examples of how operons can be used to not only derive higher-level functional information encoded in a genome but also enable computational studies of fundamental principles in terms of the global organization of operons in a genome. A major challenge to us, the developer of the DOOR database, is to keep up with the rate of prokaryotic genome sequencing. We are currently working toward the next version of the DOOR database to include 7000 complete prokaryotic genomes, along with deployment of additional tools needed to mine such large collections of operons and TUs. Key Points Figure 7. Two examples of orthologous gene prediction. (A) The two red block arrows represent two orthologous genes, which are identified based on the information that they share homologous working partners, represented by the yellow block arrows, for which the popular program RBH failed to identify. (B) Two red block arrows represent two orthologous genes, identified based on the information that they share homologous working partners (the yellow block arrows). homologous genes, which also share homologous working partners. Our analysis on a number of known homologous pathways shows that using genomic synteny information as constraints can greatly improve the pathway-mapping accuracy over methods that use sequence similarity information alone [36]. DOOR is an operon database consisting of multigene operons and single-gene operons in 2072 complete genomes. To facilitate using and mining the data in DOOR, a suite of utility tools has been developed and integrated into DOOR. With the increasing availability of genome-scale transcriptomic data for prokaryotes, increasingly more TUs have been integrated into DOOR, to facilitate systems-level studies of the biology of prokaryotes. The availability of the genome-scale operons and TUs has enabled a new class of functional analyses of prokaryotes as well as fundamental research into the general principles of prokaryotic genomes. Elucidation of the global organizing principle of operons in prokaryotic genomes The availability of genome-scale operons has also made it possible to study whether the global arrangement of operons in a prokaryotic genome is arbitrarily determined or follows some organizing rules, beyond functional studies of prokaryotic genomes. We have previously conducted a series of studies on potential determinants in the global arrangement of operons in a genome [9, 13, 38]. One observation is that operons for pathways with higher frequencies of transcriptional activation tend to be more closely clustered together in the host genome [13]. Further analyses revealed that such pathways with higher frequencies of transcriptional activation tend to have their operons arranged in a fewer supercoils, the basic folding units in a prokaryotic DNA, which are typically kbps long [10]. Our postulation is that prokaryotes have evolved to minimize the total energy needed for folding and unfolding supercoils to enable transcription of operons that need to be activated throughout the life cycle of the organism. Our computational simulation analyses through randomly reshuffling a genome strongly confirmed our hypothesis [9, 13, 38]. This represents the first determinant for the global arrangements of operons in a prokaryotic genome. Funding This work was supported by a grant from the U.S. Department of Energy (grant number #DE-PS02-06ER64304). The BioEnergy Science Center (BESC) is supported by the Office of Biological and Environmental Research in the DOE Office of Science. This work was also supported by National Science Foundation/ EPSCoR Award No. IIA , the State of South Dakota Research Innovation Center, and the Agriculture Experiment Station of South Dakota State University. References 1. Chen IMA, Markowitz VM, Chu K, et al. IMG/M: integrated genome and metagenome comparative data analysis system. Nucleic Acids Res 2016;45:D Jacob F, Perrin D, Sanchez C, et al. The operon: a group of genes with expression coordinated by an operator. C R Acad Sci Paris 1960;250: Salgado H, Moreno-Hagelsieb G, Smith TF, et al. Operons in Escherichia coli: genomic analyses and predictions. Proc Natl Acad Sci USA 2000;97: Craven M, Page D, Shavlik JW, et al. A probabilistic learning approach to whole-genome operon prediction. In: Proceedings of International Conference on Intelligent Systems for Molecular

8 8 Cao et al. Biology, 2000, p USA, AAAI Press, San Diego, California. 5. Dam P, Olman V, Harris K, et al. Operon prediction using both genome-specific and general genomic information. Nucleic Acids Res 2007;35: Chou W-C, Ma Q, Yang S, et al. Analysis of strand-specific RNA-seq data using machine learning reveals the structures of transcription units in Clostridium thermocellum. Nucleic Acids Res 2015;43:e Chen X, Chou W-C, Ma Q, et al. SeqTU: a web server for identification of bacterial transcription units. Sci Rep 2017;7: Mao X, Ma Q, Zhou C, et al. DOOR 2.0: presenting operons and their functions through dynamic and integrated views. Nucleic Acids Res 2014;42:D Ma Q, Xu Y. Global genomic arrangement of bacterial genes is closely tied with the total transcriptional efficiency. Genomics Proteomics Bioinformatics 2013;11: Ma Q, Yin Y, Schell MA, et al. Computational analyses of transcriptomic data reveal the dynamic organization of the Escherichia coli chromosome under different conditions. Nucleic Acids Res 2013;41: Li G, Ma Q, Mao X, et al. Integration of sequence-similarity and functional association information can overcome intrinsic problems in orthology mapping across bacterial genomes. Nucleic Acids Res 2011;39:e Che D, Li G, Mao F, et al. Detecting uber-operons in prokaryotic genomes. Nucleic Acids Res 2006;34: Yin Y, Zhang H, Olman V, et al. Genomic arrangement of bacterial operons is constrained by biological pathways encoded in the genome. Proc Natl Acad Sci USA 2010;107: Pertea M, Ayanbule K, Smedinghoff M, et al. OperonDB: a comprehensive database of predicted operons in microbial genomes. Nucleic Acids Res 2009;37:D Taboada B, Ciria R, Martinez-Guerrero CE, et al. ProOpDB: prokaryotic operon database. Nucleic Acids Res 2012;40:D Okuda S, Katayama T, Kawashima S, et al. ODB: a database of operons accumulating known operons across multiple genomes. Nucleic Acids Res 2006;34:D Stoddard SF, Smith BJ, Hein R, et al. rrndb: improved tools for interpreting rrna gene abundance in bacteria and archaea and a new foundation for future development. Nucleic Acids Res 2015;43:D Mao F, Dam P, Chou J, et al. DOOR: a database for prokaryotic operons. Nucleic Acids Res 2009;37:D Brouwer RWW, Kuipers OP, van Hijum SAFT. The relative value of operon predictions. Brief Bioinform 2008;9: Bailey TL, Williams N, Misleh C, et al. MEME: discovering and analyzing DNA and protein sequence motifs. Nucleic Acids Res 2006;34:W369 W Olman V, Xu D, Xu Y. CUBIC: identification of regulatory binding sites through data clustering. J Bioinform Comput Biol 2003; 1: Arthur JC, Gharaibeh RZ, Mühlbauer M, et al. Microbial genomic analysis reveals the essential role of inflammation in bacteria-induced colorectal cancer. Nat Commun 2014;5: Sorek R, Cossart P. Prokaryotic transcriptomics: a new view on regulation, physiology and pathogenicity. Nat Rev Genet 2010;11: Li G, Liu B, Ma Q, et al. A new framework for identifying cisregulatory motifs in prokaryotes. Nucleic Acids Res 2011;39:e Ma Q, Zhang H, Mao X, et al. DMINDA: an integrated web server for DNA motif identification and analyses. Nucleic Acids Res 2014;42:W12 W Yang J, Chen X, McDermaid A, et al. DMINDA 2.0: integrated and systematic views of regulatory DNA motif identification and analyses. Bioinformatics 2017, in pres. 27. Ma Q, Liu B, Zhou C, et al. An integrated toolkit for accurate prediction and analysis of cis-regulatory motifs at a genome scale. Bioinformatics 2013;29: Liu B, Zhang H, Zhou C, et al. An integrative and applicable phylogenetic footprinting framework for cis-regulatory motifs identification in prokaryotic genomes. BMC Genomics 2016;17: Prelic A, Bleuler S, Zimmermann P, et al. A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinformatics 2006;22: Tanay A, Sharan R, Shamir R. Discovering statistically significant biclusters in gene expression data. Bioinformatics 2002; 18:S136 S Li G, Ma Q, Tang H, et al. QUBIC: a qualitative biclustering algorithm for analyses of gene expression data. Nucleic Acids Res 2009;37:e Zhou F, Ma Q, Li G, et al. QServer: a biclustering server for prediction and assessment of co-expressed gene clusters. PLoS One 2012;7:e Zhang Y, Xie J, Yang J, et al. QUBIC: a bioconductor package for qualitative biclustering analysis of gene co-expression data. Bioinformatics 2017;33: Lathe WC, III, Snel B, Bork P. Gene context conservation of a higher order than operons. Trends Biochem Sci 2000;25: Wu H, Su Z, Mao F, et al. Prediction of functional modules based on comparative genome analysis and Gene Ontology application. Nucleic Acids Res 2005;33: Mao F, Su Z, Olman V, et al. Mapping of orthologous genes in the context of biological pathways: an application of integer programming. Proc Natl Acad Sci USA 2006;103: Liu B, Zhou C, Li G, et al. Bacterial regulon modeling and prediction based on systematic cis regulatory motif analyses. Sci Rep 2016;6: Ma Q, Chen X, Liu C, et al. Understanding the commonalities and differences in genomic organizations across closely related bacteria from an energy perspective. Sci China Life Sci 2014;57:

DOOR: A microbial operon database for gene and genome organization and function discovery

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 DOOR: A microbial operon database for gene and genome organization and function discovery Huansheng Cao, Qin Ma, Xin