Causes for the Large Genome Size in a Cyanobacterium

Size: px

Start display at page:

Download "Causes for the Large Genome Size in a Cyanobacterium"

Geoffrey Mervyn Hutchinson
5 years ago
Views:

1 Genome Informatics 15(1): (2004) 229 Causes for the Large Genome Size in a Cyanobacterium Anabaena sp. PCC7120 Nobuyoshi Sugaya 1 Makihiko Sato 1,2 Hiroo Murakami 1 sugaya@ims.u-tokyo.ac.jp makihiko@ims.u-tokyo.ac.jp hiroo@ims.u-tokyo.ac.jp Akira Imaizumi 1,3 Sachiyo Aburatani 1 Katsuhisa Horimoto 1 akima@ims.u-tokyo.ac.jp sachiyo@ims.u-tokyo.ac.jp khorimot@ims.u-tokyo.ac.jp 1 Laboratory of Biostatistics, Human Genome Center, Institute of Medical Science, Universityof Tokyo, Shirokane-dai, Minato-ku, Tokyo , Japan 2 Computer Science and Engineering Centre, Fujitsu Ltd., Nakase, Mihama-ku, Chiba City, Chiba , Japan 3 Advanced Technology Department, Fermentation and Biotechnology Laboratories, AJINOMOTOCO., INC., 1-1 Suzuki-cho, Kawasaki-ku, Kawasaki-shi , Japan Abstract Three possible causes responsible for the large genome size of a cyanobacterium Anabaena sp. PCC7120 are investigated: 1) sequential tandem duplications of gene segments, genes or genomic segments, 2) horizontal gene transfers from other organisms, and 3) whole-genome duplication. We evaluated the frequency distribution of angles between paralog locations for the possibility 1), the fraction of genes deviated in GC content, GC skew, AT skew and codon adaptation index for the 2) and the gene-configuration comparison of paralogs for the 3). As a result, the possibility 3), the whole-genome duplication, was more reasonable as a molecular cause than the other causes for the large genome size in Anabaena sp. PCC7120. In addition, the whole-genome duplication was supported by the analysis of distribution pattern of protein genes with respect to functional categories. Keywords: genome-size increase, tandem gene duplication, horizontal gene transfer, whole-genome duplication, gene-location distance, Anabaena sp. PCC Introduction In the phylum Cyanobacteria, complete genomic sequences have been determined in eight organisms: Synechocystis sp. PCC6803 [12], Anabaena sp. PCC7120 [11], Thermosynechococcus elongatus BP- 1 [20], Synechococcus sp. WH8102 [24], Prochlorococcus marinus SS120 [4], P. marinus MED4, P. marinus MIT9313 [26] and Gloeobacter violaceus PCC7421 [21]. Interestingly, only one species of the cyanobacteria, Anabaena sp. PCC7120 (hereafter Anabaena), has considerably large genome size of approximately 6.4 megabase (Mb), while the remaining species have the medium or small sizes of genomes ranging from approximately 1.7 Mb to 4.7 Mb. Size difference in bacterial genomes has been well known between parasitic bacteria such as Mycoplasma and Buchnera and free-living bacteria [2, 16, 17]. The genome sizes of the former are almost all in the range of about 1 Mb or less, while the genome sizes of the latter have on average larger genome sizes than the parasitic bacteria. The marked difference in genome sizes has been ascribed to the genome-size reduction in parasitic species, because in their genomes a large number of genes involved in biological processes such as biosyntheses of nutrients, cell motility and DNA repair system could have been lost during their course of evolution [2, 16, 17].

2 230 Sugaya et al. In contrast, a remarkable genome-size increase in free-living bacteria is a problem difficult to answer what causes the increase. Since all complete genomic sequences of cyanobacterial species have been derived from free-living species and the genomes show a remarkable size variation, they provide a good opportunity for investigating a molecular cause for genome-size difference in free-living bacteria. In the previous studies, three causes have been considered for the genome-size difference between freeliving bacteria (Fig. 1): 1) sequential tandem duplications of gene segments, single genes or genomic segments being composed of strings of some genes (e.g. operons) [7, 28], 2) horizontal gene transfers from other organisms [10, 22] and 3) whole-genome duplication [13, 25, 29]. In particular, there are little studies about the whole-genome duplication since the first determination of complete genomic sequence, while the two former causes are intensively investigated in a genomic scale. In this study, we investigate the three possibilities of molecular causes for the large size of genome in Anabaena. Figure 1: Schematic demonstration of the three mechanisms for bacterial genome enlargement. Genes encoded on an original genome and those newly arisen by each mechanism are shown by closed and open boxes, respectively.

3 Causes for the Large Genome Size in a Cyanobacterium Anabaena sp. PCC Materials and Methods 2.1 Genomic Data The information about all proteins encoded on the genome of the cyanobacterium Anabaena [11] is obtained from ftp site of National Center for Biotechnology Information [31]. Although Anabaena has several plasmids in its cell, we analyze the protein encoded on the chromosome in this study. The size of the Anabaena chromosome is 6,413,771 base pairs, and the chromosome encodes 5,366 proteins. 2.2 Paralogs in the Anabaena Genome To investigate the possibility of tandem gene duplications, we calculate the difference of the locations between the paralogs in the gene families. The difference is expected to be small if the occurrence of tandem duplications cause to the large genome size in Anabaena. Paralog families are searched with the program BLASTCLUST [1]. A protein is included in a paralog family by single-linkage clustering algorithm, when the protein satisfies the criteria that the e-value between the protein and a member of the paralog family is e 10 10, and that the pairwise alignment between them covers the region of 60% of both amino acid sequences of proteins. Paralog pairs used for calculating gene-location distance between two half genomes from Anabaena (see below) are detected with the program BLASTP [1] as pairs satisfying the criteria of reciprocal best hit and e Calculation of GC Content, GC Skew, AT Skew and Codon Adaptation Index (CAI) To estimate whether a gene has been horizontally transferred or vertically inherited, the degree of the deviation on nucleotide compositions is investigated. For this purpose, four measurements that are frequently used for the estimation are calculated: GC composition, GC skew, AT skew, and a bias in codon usage [3, 14, 15, 18]. The values of GC content, GC skew and AT skew are calculated for each gene as (f G + f C )/(f A + f T +f G +f C ), (f C f G )/(f C +f G ) and (f A f T )/(f A +f T ), respectively, when a frequency of base N in a gene is denoted by f N. As for the usage bias in codons, the bias is estimated by the CAI that measures the degree of bias toward the subset of codons used by highly expressed genes in an organism [27]. According to the previous study in Synechocystis sp. PCC6803 (Table 1 in Mrázek et al. [19]), the ribosomal proteins and proteins orthologous to predicted highly expressed ones are selected as the reference set. The set in Anabaena is composed of 55 ribosomal proteins, 13 photosynthesis/respiration related proteins, 6 chaperons, 5 translation/transcription processing factors and 11 proteins involved in other functions. 2.4 Calculation of Gene-Location Distance (GLD) To investigate the possibility of the whole genome duplication, we investigate the gene configuration of paralogs between two half regions of the Anabaena genome. The procedure is schematically shown in Fig. 2. To estimate the similarity of paralog configuration in two hypothetical half genomes, we calculate for each paralog pair the gene-location distance (GLD) [8, 9] that is derived from the correlation coefficient for circular data in the directional statistics [6]. One of the remarkable features of the GLD is that the fixation of a gene pair with the shortest GLD among all GLDs for gene pairs between two compared genomes in the same direction realizes the most similar configuration of all related genes in the compared genomes. With the use of the feature, we find the shortest GLD between the two half of genomes that are generated at each cutting angle.

4 232 Sugaya et al. Figure 2: Procedure for comparing gene locations between two half genomes of Anabaena. The actual Anabaena genome (shown by solid circle A) is divided into two half regions, 1 and 2, by cutting it at angles θ A and θ A (broken line through the circle A). Then, two hypothetical genomes (broken circles A 1 and A 2 ) are constructed from these two half regions by setting the angle θ A or θ A on the original genome A as the angle 0 on the A 1 or A 2 genome, respectively. Paralog pairs are searched between these two hypothetical genomes, and gene-location distances (GLDs) are calculated for all of these paralog pairs by rotating two hypothetical genomes. Among the GLDs, the shortest GLD is plotted against the cutting angle on the Anabaena genome. The above operation is iterated along each location of all protein genes. The GLD for a pair of gene i between circular genomes A and B is defined by the following equation: n sin(θi A θj A ) sin(θb i θj B ) g(a, B) Di = 0.5 j, j i (1) n n sin 2 (θi A θj A ) sin 2 (θi B θj B ) j, j i where n is the total number of paralog pairs between two compared genomes, and θi A (or θj A) and θb i (or θj B ) denote the angles of the ith (or jth) paralog pair on A and B, respectively. The distances range from 0.0 to 1.0 depending on the degree of the dissimilarity of their gene locations to the locations of other gene pairs on two genomes. Because GLD is invariant irrespective of the selection of the position for measuring the angles [9], in this study, the angles of genes on a genome are defined by 5 positions for coding regions of genes in GenBank-format file, regardless of the gene-encoded strands. j, j i

5 Causes for the Large Genome Size in a Cyanobacterium Anabaena sp. PCC Results 3.1 Assessment of Tandem Gene Duplication In this section, we investigate a possibility of tandem gene duplication by focusing on angles between paralog pairs on the Anabaena genome. The numbers of paralog families and paralog pairs detected with BLASTCLUST [1] are 581 and 17,955, respectively. The angles are calculated for all paralog pairs in each paralog family. The frequency distribution of the angles between paralogs is shown in Fig. 3. As seen in the figure, the distribution is almost uniform. This result indicates that many paralogs are not clustered within a region but distributed at various intervals on the Anabaena genome. On the assumption that most of Anabaena paralogs have been created by tandem gene duplications, the distribution in Fig. 3 is expected to be skewed in the small angles. Although the number of paralog pairs in the range of 0-10 is to some extent larger than other ranges, the fraction amounts to only 8.1%. Translocations of one of two genes subsequent to tandem duplications appears to be likely as an explanation of the nearly uniformity of the distribution. The rate of translocations, however, is not known in bacterial genomes, and thus we have no information how frequently translocations occur in the Anabaena genome. In the present study, we therefore judge that it seems difficult to explain the large size of the Anabaena genome by tandem gene duplications. Figure 3: Frequency distribution of angles between the locations of paralog pairs on the Anabaena genome. The angles between paralog pairs are calculated for all combinations of paralogs in each paralog family.

6 234 Sugaya et al. 3.2 Assessment of Horizontal Gene Transfer In this section, a possibility of horizontal gene transfer is investigated on the basis of the values of GC content, GC skew, AT skew and CAI for each Anabaena gene. The values are calculated for all 5,366 proteins encoded on the Anabaena chromosome. The four measurements were plotted against the locations on the genome in Fig. 4. As easily seen in the figures, the low fractions of genes display atypical base compositions and a biased codon usage. Indeed, the numbers with less than 5% of chance probability in each distribution of Figs. 4(a)-(d) are 300 (5.6%), 303 (5.7%), 357 (6.7%) and 272 (5.1%), respectively. Among the deviated genes, furthermore, we list the genes whose homologs are detected only in other bacteria excluding cyanobacteria as candidates of recently transferred genes with the criterion of e by the BLASTP. The numbers of the genes thus listed are only 37, 31, 18 and 28 in each distribution, respectively. The results indicate that most of protein-encoding genes on the contemporary Anabaena genome may be native genes vertically descended from a direct ancestor of the Anabaena lineage. Therefore, it seems to be difficult to explain the large size of the Anabaena genome by horizontal gene transfers. Figure 4: Plots of values of (a) GC content, (b) GC skew, (c) AT skew and (d) codon adaptation index (CAI) for each protein gene against an angle on the Anabaena genome. 3.3 Assessment of Whole-Genome Duplication In this section, a possibility of whole-genome duplication is investigated with the use of the measure of GLD. As described in Fig. 2, the Anabaena genome is divided into two half genomes by cutting the actual genome at each angle of paralogs, and the locations of paralogs are compared between the two with the GLD.

7 Causes for the Large Genome Size in a Cyanobacterium Anabaena sp. PCC A plot of GLDs is shown in Fig. 5. The GLD-plot shows a periodic pattern at 180 intervals, because a pair of hypothetical genomes yielded at a cutting angle θ A is the same as that yielded at θ A In the figure, we found the shortest GLDs at two cutting angles of 85 and 265 In other words, the paralog locations between the two hypothetical genomes show the most similar configuration when the Anabaena genome is cut at angles around 85 and 265 Furthermore, the shortest GLD is with the significance probability (P < 0.05) in an extreme-value distribution of GLDs by a simulation of randomizing paralog pairs. This indicates that the Anabaena genome may be composed of the two regions of and In summary, the gene-configuration comparison indicates that the whole-genome duplication is reasonable as a molecular cause for the large size of the Anabaena genome. Figure 5: Plot of the shortest GLDs against cutting angle on the Anabaena genome. The shortest GLDs were found at the cutting angles of all1273 (84.9 ) and all3916 (265.0 ). Each value is obtained when two hypothetical genomes are rotated in the direction that the angle of one paralog on one hypothetical genome agrees with that of another paralog on another hypothetical genome; the paralog pairs are all3016 (205.7 ) - all0012 (0.6 ) when cut at 84.9 and alr5317 (356.5 ) - asr3019 (205.8 ) when cut at The location (µ), scale (σ) and shape (ξ) parameters of the extreme-value distribution obtained by a simulation of randomizing paralog pairs are (µ, σ, ξ) = (0.457, 0.014, ). 4 Discussion and Conclusions The results in our present analyses indicate that only low fraction of the Anabaena proteins may have their origin in tandem gene duplication or horizontal gene transfer. This implies that it is difficult to

8 236 Sugaya et al. explain the large size of the Anabaena genome by the causes operating on the local structure of gene segments, single genes or strings of some genes. On the contrary, the comprehensive comparison of gene configuration with the measure of GLD reveals that the contemporary Anabaena genome may consist of two half regions. The feature of the genome may be ascribed to the cause of whole-genome duplication globally operating on all genes encoded on the genome. These results confirm us that in this genomic era it is necessary to study bacterial genomes from more macroscopic view of global structure of the genomes as well as from current view dealing with each gene at the level of amino acid/nucleotide sequences. Based on the result in the gene-configuration comparison, we further examine distribution pattern of genes on the two regions, the region and the one, on the Anabaena genome with respect to functional categories. The result is summarized in Table 1. As a result, the gene distributions on the two half of genomes are biased with the significance probability (P < 10 6 ). By the following residual analysis, a remarkable feature emerged for the functional categories. Indeed, 73% of proteins in the functional category Photosynthesis and respiration and 63% of those in Translation are encoded on the region. On the other hand, functional categories whose distributions are biased to the region are Central intermediary metabolism and Transport and binding proteins. Interestingly, most of proteins in the former categories have housekeeping functions that are essential to cyanobacteria, while many proteins in the latter categories have functions that are needed in some particular environment such as proteins involved in nitrogen fixation and metabolism under a condition of nitrogen deprivation [5]. The distribution patterns of functional categories may reflect a process through which the Anabaena genome has acquired protein genes with novel functions. Table 1: Distribution pattern of protein genes on the Anabaena genome with respect to functional categories. No. Functional category a Number of protein genes P value b Amino acid biosynthesis Biosynthesis of cofactors, prosthetic groups, and carriers Cell envelop Cellular processes Central intermediary metabolism b < Energy metabolism Fatty acid, phospholipid and sterol metabolism Photosynthesis and respiration b < Purines, pyrimidines, nucleosides, and nucleotides Regulatory functions DNA replication, recombination, and repair Transcription Translation b < Transport and binding proteins b < Other categories The number of protein genes included in each functional category is counted in two regions, the region and the one, on the Anabaena genome. The distribution pattern of functional categories is significantly not independent by the chi-squared test of independence (χ 2 = 58.4, P < 10 6 ). a Classification of functional categories and assignment of each protein to the categories follow those in Kaneko et al. [11]. b Functional categories showing a biased distribution with statistical significance in the residual analysis are underlined. P values are based on N(0, 1 2 ).

9 Causes for the Large Genome Size in a Cyanobacterium Anabaena sp. PCC As proposed by Ohno [23], whole-genome duplication can be a quick and easy way to amplify its gene repertoire with respect to protein functions. In this sense, the gene-configuration similarity and the biased gene distribution between two hypothetical half genomes support a possibility of wholegenome duplication in the Anabaena genome. Ancient whole-genome duplications (paleopolyploidy) have been already proposed in some lineages of eukaryotes [30]. Also in bacterial genomes, the event might have occurred and contributed bacterial species to expand their capabilities to adapt various environments on the earth through amplification of gene repertoire. Acknowledgments One of the authors (K. H.) was partly supported by a Grant-in-Aid for Scientific Research on Priority Areas Genome Information Science (grant ) and for Scientific Research (B) (grant ), from the Ministry of Education, Culture, Sports, Science and Technology of Japan. References [1] Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, Z., Miller, W., and Lipman, D.J., Gapped BLAST and PSI-BLAST: A new generation of protein database search programs, Nucleic Acids Res., 25: , [2] Andersson, S.G.E. and Kurland, C.G., Reductive evolution of resident genomes, Trends Microbiol., 6: , [3] Carbone, A., Zinovyev, A., and Képès, F., Codon adaptation index as a measure of dominating codon bias, Bioinformatics, 19: , [4] Dufresne, A., Salanoubat, M., Partensky, F., et al., Genome sequence of the cyanobacterium Prochlorococcus marinus SS120, a nearly minimal oxyphototrophic genome, Proc. Natl. Acad. Sci. USA, 100: , [5] Ehira, S., Ohmori, M., and Sato, N., Genome-wide expression analysis of the responses to nitrogen deprivation in the heterocyst-forming cyanobacterium Anabaena sp. strain PCC7120, DNA Res., 10:97 113, [6] Fisher, N.I. and Lee, A.J., A correlation coefficient for circular data, Biometrika, 70: , [7] Gu, Z., Cavalcanti, A., Chen, F.-C., Bouman, P., and Li, W.-H., Extent of gene duplication in the genomes of Drosophila, nematode, and yeast, Mol. Biol. Evol., 19: , [8] Horimoto, K., Fukuchi, S., and Mori, K., Comprehensive comparison between locations of orthologous genes on archaeal and bacterial genomes, Bioinformatics, 17: , [9] Horimoto, K., Suyama, M., Toh, H., Mori, K., and Otsuka, J., A method for comparing circular genomes from gene locations: application to mitochondrial genomes, Bioinformatics, 14: , [10] Jain, R., Rivera, M.C., Moore, J.E., and Lake, J.A., Horizontal gene transfer accelerates genome innovation and evolution, Mol. Biol. Evol., 20: , [11] Kaneko, T., Nakamura, Y., Wolk, C.P., et al., Complete genomic sequence of the filamentous nitrogen-fixing cyanobacterium Anabaena sp. strain PCC7120, DNA Res., 8: , , 2001.

10 238 Sugaya et al. [12] Kaneko, T., Sato, S., Kotani, H., et al., Sequence analysis of the genome of the unicellular cyanobacterium Synechocystis sp. strain PCC6803. II. Sequence determination of the entire genome and assignment of potential protein-coding regions, DNA Res., 3: , [13] Kunisawa, T. and Otsuka, J., Periodic distribution of homologous genes or gene segments on the Escherichia coli K12 genome, Protein Seq. Data Anal., 1: , [14] Lawrence, J.G. and Ochman, H., Amelioration of bacterial genomes: Rates of change and exchange, J. Mol. Evol., 44: , [15] Lawrence, J.G. and Ochman, H., Molecular archaeology of the Escherichia coli genome, Proc. Natl. Acad. Sci. USA, 95: , [16] Maniloff, J., The minimal cell genome: On being the right size, Proc. Natl. Acad. Sci. USA, 93: , [17] Mira, A., Ochman, H., and Moran, N.A., Deletional bias and the evolution of bacterial genomes, Trends Genet., 17: , [18] Moszer, I., Rocha, E.P.C., and Danchin, A., Codon usage and lateral gene transfer in Bacillus subtilis, Curr. Opin. Microbiol., 2: , [19] Mrázek, J., Bhaya, D., Grossman, A.R., and Karlin, S., Highly expressed and alien genes of the Synechocystis genome, Nucleic Acids Res., 29: , [20] Nakamura, Y., Kaneko, T., Sato, S., et al., Complete genome structure of the thermophilic cyanobacterium Thermosynechococcus elongatus BP-1, DNA Res., 9: , [21] Nakamura, Y., Kaneko, T., Sato, S., et al., Complete genome structure of Gloeobacter violaceus PCC7421, a cyanobacterium that lacks thylakoids, DNA Res., 10: , [22] Ochman, H., Lawrence, J.G., and Groisman, E.A., Lateral gene transfer and the nature of bacterial innovation, Nature, 18: , [23] Ohno, S., Evolution by gene duplication, Springer-Verlag, New York, [24] Palenik, B., Brahamsha, B., Larimer, F.W., et al., The genome of a motile marine Synechococcus, Nature, 424: , [25] Riley, M. and Anilionis, A., Evolution of the bacterial genome, Annu. Rev. Microbiol., 32: , [26] Rocap, G., Larimer, F.W., Lamerdin, J., et al., Genome divergence in two Prochlorococcus ecotypes reflects oceanic niche differentiation, Nature, 424: , [27] Sharp, P.M. and Li, W.-H., The codon adaptation index - a measure of directional synonymous codon usage bias, and its potential applications, Nucleic Acids Res., 15: , [28] Snel, B., Bork, P., and Huynen, M.A., Genomes in flux: The evolution of archaeal and proteobacterial gene content, Genome Res., 12:17 25, [29] Wallace, D.C. and Morowitz, H.J., Genome size and evolution, Chromosoma, 40: , [30] Wolfe, K.H., Yesterday s polyploids and the mystery of diploidization, Nature Rev. Genet., 2: , [31] ftp://ftp.ncbi.nih.gov/genbank/genomes/bacteria/nostoc_sp/

Topology. 1 Introduction. 2 Chromosomes Topology & Counts. 3 Genome size. 4 Replichores and gene orientation. 5 Chirochores.

Topology. 1 Introduction. 2 Chromosomes Topology & Counts. 3 Genome size. 4 Replichores and gene orientation. 5 Chirochores. Topology 1 Introduction 2 3 Genome size 4 Replichores and gene orientation 5 Chirochores 6 G+C content 7 Codon usage 27 marc.bailly-bechet@univ-lyon1.fr The big picture Eukaryota Bacteria Many linear chromosomes