Mass Identification of Chloroplast Proteins of Endosymbiont Origin by Phylogenetic Profiling Based on Organism-Optimized Homologous Protein Groups

Similar documents
Apicoplast. Apicoplast - history. Treatments and New drug targets

Computational approaches for functional genomics

Introduction to Bioinformatics Integrated Science, 11/9/05

Microbes usually have few distinguishing properties that relate them, so a hierarchical taxonomy mainly has not been possible.

Microbial Taxonomy. Slowly evolving molecules (e.g., rrna) used for large-scale structure; "fast- clock" molecules for fine-structure.

Origins of Life. Fundamental Properties of Life. Conditions on Early Earth. Evolution of Cells. The Tree of Life

2 Genome evolution: gene fusion versus gene fission

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1

Microbial Taxonomy. Microbes usually have few distinguishing properties that relate them, so a hierarchical taxonomy mainly has not been possible.

BL1102 Essay. The Cells Behind The Cells

Chapter 26: Phylogeny and the Tree of Life Phylogenies Show Evolutionary Relationships

Energy Converion: Mitochondria and Chloroplasts. Pınar Tulay, Ph.D.

Organelle genome evolution

ORIGIN OF CELLULARITY AND CELLULAR DIVERSITY

The Minimal-Gene-Set -Kapil PHY498BIO, HW 3

Microbial Taxonomy and the Evolution of Diversity

SUPPLEMENTARY INFORMATION

BLAST. Varieties of BLAST

Ch 7: Cell Structure and Functions. AP Biology

HORIZONTAL TRANSFER IN EUKARYOTES KIMBERLEY MC GRAIL FERNÁNDEZ GENOMICS

Big Idea 1: The process of evolution drives the diversity and unity of life. Sunday, August 28, 16

Genome Annotation. Bioinformatics and Computational Biology. Genome sequencing Assembly. Gene prediction. Protein targeting.

SPECIES OF ARCHAEA ARE MORE CLOSELY RELATED TO EUKARYOTES THAN ARE SPECIES OF PROKARYOTES.

Reconstructing Mitochondrial Evolution?? Morphological Diversity. Mitochondrial Diversity??? What is your definition of a mitochondrion??

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

In-Depth Assessment of Local Sequence Alignment

Chapter 19. Microbial Taxonomy

Comparative Bioinformatics Midterm II Fall 2004

Bio 1B Lecture Outline (please print and bring along) Fall, 2007

Sequence Alignment Techniques and Their Uses

Elements of Bioinformatics 14F01 TP5 -Phylogenetic analysis

Light reaction. Dark reaction

Dynamic optimisation identifies optimal programs for pathway regulation in prokaryotes. - Supplementary Information -

Phylogeny & Systematics

This is a repository copy of Microbiology: Mind the gaps in cellular evolution.

Biology 160 Cell Lab. Name Lab Section: 1:00pm 3:00 pm. Student Learning Outcomes:

2. Cellular and Molecular Biology

Comparative genomics: Overview & Tools + MUMmer algorithm

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

Bioinformatics. Dept. of Computational Biology & Bioinformatics

PHYLOGENY AND SYSTEMATICS

Sequenced Mitochondrial Genomes of Bryophytes

Text of objective. Investigate and describe the structure and functions of cells including: Cell organelles

Eukaryotic Cells. Figure 1: A mitochondrion

Molecular evolution - Part 1. Pawan Dhar BII

ORIGIN OF METABOLISM Where did early life get its energy? How did cell structures become complex?

What Organelle Makes Proteins According To The Instructions Given By Dna

Overview of Cells. Prokaryotes vs Eukaryotes The Cell Organelles The Endosymbiotic Theory

SUPPLEMENTARY INFORMATION

Evolution Problem Drill 09: The Tree of Life

Figure S1: Mitochondrial gene map for Pythium ultimum BR144. Arrows indicate transcriptional orientation, clockwise for the outer row and

Biology 105/Summer Bacterial Genetics 8/12/ Bacterial Genomes p Gene Transfer Mechanisms in Bacteria p.

SUPPLEMENTARY METHODS

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

Small RNA in rice genome

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

Supplementary Information

Bio 119 Bacterial Genomics 6/26/10

Microbial Diversity and Assessment (II) Spring, 2007 Guangyi Wang, Ph.D. POST103B

Interpreting the Molecular Tree of Life: What Happened in Early Evolution? Norm Pace MCD Biology University of Colorado-Boulder

AP BIOLOGY SUMMER ASSIGNMENT

I. Molecules and Cells: Cells are the structural and functional units of life; cellular processes are based on physical and chemical changes.

Map of AP-Aligned Bio-Rad Kits with Learning Objectives

1. The basic structural and physiological unit of all living organisms is the A) aggregate. B) organelle. C) organism. D) membrane. E) cell.

Supporting online material

Biology. Slide 1 of 36. End Show. Copyright Pearson Prentice Hall

Name: Class: Date: ID: A

MiGA: The Microbial Genome Atlas

NATS 104 LIFE ON EARTH SPRING, 2004 FIRST 100-pt EXAM. (each question 2 points)

Introduction to cells

Introductory Microbiology Dr. Hala Al Daghistani

T H E J O U R N A L O F C E L L B I O L O G Y

Introduction to Molecular and Cell Biology

Genetic Variation: The genetic substrate for natural selection. Horizontal Gene Transfer. General Principles 10/2/17.

INTERACTIVE CLUSTERING FOR EXPLORATION OF GENOMIC DATA

Unit 3: Cells. Objective: To be able to compare and contrast the differences between Prokaryotic and Eukaryotic Cells.

AST 205. Lecture 18. November 19, 2003 Microbes and the Origin of Life. Precept assignment for week of Dec 1

2012 Univ Aguilera Lecture. Introduction to Molecular and Cell Biology

Nature Genetics: doi: /ng Supplementary Figure 1. Icm/Dot secretion system region I in 41 Legionella species.

UNIT 3 CP BIOLOGY: Cell Structure

Expression of nuclearencoded. photosynthesis in sea slug (Elysia chlorotica)

BIOINFORMATICS: An Introduction

Microbial Diversity. Yuzhen Ye I609 Bioinformatics Seminar I (Spring 2010) School of Informatics and Computing Indiana University

Cell Organelles. a review of structure and function

Computational methods for predicting protein-protein interactions

Miller & Levine Biology 2014

GACE Biology Assessment Test I (026) Curriculum Crosswalk

Tor Olafsson. evolution.berkeley.edu 1

Biology 2180 Laboratory # 5 Name Plant Cell Fractionation

I. Molecules & Cells. A. Unit One: The Nature of Science. B. Unit Two: The Chemistry of Life. C. Unit Three: The Biology of the Cell.

Chapters 25 and 26. Searching for Homology. Phylogeny

METHODS FOR DETERMINING PHYLOGENY. In Chapter 11, we discovered that classifying organisms into groups was, and still is, a difficult task.

Bacillus anthracis. Last Lecture: 1. Introduction 2. History 3. Koch s Postulates. 1. Prokaryote vs. Eukaryote 2. Classifying prokaryotes

Bioinformatics Chapter 1. Introduction

BIOLOGY STANDARDS BASED RUBRIC

8/23/2014. Phylogeny and the Tree of Life

CHAPTER 1 INTRODUCTION TO CELLS 2009 Garland Science Publishing 3 rd Edition

Biology Science Crosswalk

Study Guide: Fall Final Exam H O N O R S B I O L O G Y : U N I T S 1-5

Class IX: Biology Chapter 5: The fundamental unit of life. Chapter Notes. 1) In 1665, Robert Hooke first discovered and named the cells.

Transcription:

56 Genome Informatics 16(2): 56 68 (2005) Mass Identification of Chloroplast Proteins of Endosymbiont Origin by Phylogenetic Profiling Based on Organism-Optimized Homologous Protein Groups Naoki Sato 1 Masayuki Ishikawa 1 naokisat@bio.c.u-tokyo.ac.jp Ishimasa@bio.c.u-tokyo.ac.jp Makoto Fujiwara 1 Kintake Sonoike 2 mtf1@bio.c.u-tokyo.ac.jp sonoike@k.u-tokyo.ac.jp 1 Department of Life Sciences, Graduate School of Arts and Sciences, University of Tokyo, 3-8-1 Komaba, Meguro-ku, Tokyo 153-8902, Japan 2 Department of Integrated Biosciences, Graduate School of Frontier Sciences, University of Tokyo, 5-1-5 Kashiwanoha, Kashiwa-shi, Chiba, 277-8562, Japan Abstract Chloroplasts originate from ancient cyanobacteria-like endosymbiont. Several tens of chloroplast proteins are encoded by the chloroplast genome, while more than hundreds are encoded by the nuclear genome in plants and algae, but the exact number and identity of nuclear-encoded chloroplast proteins are still unknown. We describe here attempts to identify a large number of unidentified chloroplast proteins of endosymbiont origin (CPRENDOs). Our strategy consists of whole genome protein clustering by the homolog group method, which is optimized for organism number, and phylogenetic profiling that extract groups conserved in cyanobacteria and photosynthetic eukaryotes. An initial minimal set of CPRENDOs was predicted without targeting prediction and experimentally validated. Keywords: genomic clustering, homolog group, CPRENDO, Gclust, endosymbiosis 1 Introduction Chloroplast is a photosynthetic organelle within plant and algal cells. It is also present as chromoplast, amyloplast, elaioplast, and leucoplast, depending on types of cells in flowering plants. A general term for all these organelles related to chloroplast is plastid. Plastid is also involved in various metabolism such as biosynthesis of fatty acids, isoprenoids, tetrapyrrols, amino acids, and some plant hormones. It is also the sole site of assimilation of nitrogen and sulfur in plant cells. Plants (and algae) acquired chloroplasts by endosymbiosis, which occurred 1.6 Ga (billion years ago) [22]. The endosymbiont was closely related to present-day cyanobacteria [4], but it is still not clear which cyanobacterium was the most related to the chloroplast ancestor. Such endosymbiosis theory is supported by the fact that the genes encoded in the chloroplast genomes are phylogenetically most related to the orthologs in cyanobacteria. Indeed, the endosymbiosis was a big event of massive transfer of genes from cyanobacteria to photosynthetic eukaryotes, and is a good target of comparative genomic studies. In algae and plants, many chloroplast proteins are encoded by the nuclear genome, and many of them are supposed to be transferred from the ancient endosymbiont. Chloroplasts also use proteins of eukaryotic origin. Therefore, chloroplast proteome is a chimera of proteins originated from both endosymbiont and eukaryotic host [1, 13, 17]. However, photosynthesis-related proteins and the enzymes involved in chloroplast biogenesis (transcription and translation) are mostly of endosymbiont origin. Based on this consideration, we tried to estimate the list of chloroplast proteins that were acquired by the

Mass Identification of Chloroplast Proteins of Endosymbiont Origin 57 endosymbiotic event. This is a good (maybe the best of all similar examples) challenge of comparative genomics [3, 5]. We present here a generally applicable method of phylogenetic profiling, which focuses on unidentified proteins that are conserved in a certain group of organisms that share a common physiological property or pathway. After the initial presentation in GIW three year ago [18], we made efforts in both computational and experimental works [19, 20]. In the computational efforts, the Gclust software has been revised as described above and implemented the clique mode. In addition, use of an intermediate file facilitated rapid analysis with different parameters. In the experimental efforts, a minimal set of CPRENDOs as estimated using the old version of Gclust was analyzed. In the present communication, we present results of revised prediction of CPRENDOs based on current version of Gclust as well as results of experimental verification of the minimal set CPRENDOs, and discuss on the effectiveness of phylogenetic profiling in comparative genomics. 2 Method 2.1 Major Features of the Methodology The following points are emphasized in the present study: (1) Use of homolog groups but NOT ortholog groups (based on bidirectional best hit) A usually used method for phylogenetic clustering relies on ortholog groups. Two genes (or proteins) are defined as orthologs if they originate from an identical ancestral gene. However, in computational biology, orthologs are operationally defined by bi-directional best-hit relationship inferred by BLAST or SSEARCH analysis. In practice, several paralogs or highly related genes are present in every genome such as those encoding protein families, and it is not always easy and practical to identify the correct orthologs (as originally defined) without phylogenetic analysis. We have been using homolog group method [7], in which all homologous proteins are included in each cluster. Such method allows detailed phylogenetic analysis of the homolog group to identify true orthologs. (2) Use of both E-value and homologous regions of BLASTP output for clustering Many clustering practices use BLASTP or SSEARCH data for hierarchical clustering using a single criterion such as E-value. Use of such simple criterion in BLASTP-based homolog group method produces large aggregates of various proteins bridged by multidomain proteins [18, 19]. To avoid this, we use homologous region information to infer both overlap score and domain structure. Overlap score for two proteins is defined as sum of total overlap region in both proteins devided by total length of both proteins, namely, (a1 + a2 + b1 + b1 + b2)/(length1 + length2) in the example shown in Figure 1. Constraint for E-value, overlap score, and domain structure are used for clustering to infer really homologous proteins by excluding functionally different proteins sharing one or several domains. Figure 1: An example showing calculation of overlap score homologous regions are indicated.

58 Sato et al. (3) Organism-optimized clustering If homolog group is constructed solely based on sequence similarity, clusters are not always suitable for phylogenetic profiling. A cluster may contain many proteins of the same family, or a single protein family is split into several different clusters according to phylogenetic positions. This problem is partially solved during the initial cluster formation using the 2D table (see below) and at the last stage of clustering. (4) Experimental validation of computational estimation We believe that any bioinformatics inference should be experimentally validated. In many informatics studies, logical consistency is the sole criterion of evaluation of computational estimation. But biologically meaningful results are most important in bioinformatics. Inference of chloroplast proteins of endosymbiont origin (CPRENDOs) may be one of the best applications of phylogenetic profiling that can be experimentally verified. 2.2 Preparation for Clustering All proteins in selected genomes were clustered by homolog group method. To this end, one of the authors (NS) developed a software called Gclust, which reads all-against-all BLASTP results and outputs a list of homologous protein groups (homolog groups). The software was written in C, and runs on any common UNIX machines if enough memory is available. The overall flow of data processing is shown in Figure 2. A typical source of genomic protein data is a GenBank flat file. The gbk file was processed to produce a FASTA file and a file of annotation. Such data of various genomes were assembled to get a single FASTA file and an annotation table. For eukaryotic organisms, nuclear as well as organellar (mitochondrial and chlorplast, if present) genomes were used. The two files were processed to give another FASTA file (**.gfa) and an annotation table (**.g.table). In the **.gfa file, all protein names were converted to numbers to save disk space during the BLASTP search. The numbers can be converted back to the original protein names by referencing the **.g.table. Next, all-against-all BLASTP search (versions 2.1.2-2.2.12) [2] was done using the FASTA file (**.gfa) as an input. The output was directly pipelined into bl2ls3.pl to produce a list of homology regions and E-values, using a threshold for E-value at 1e-3. The resultant file was then used for input into Gclust software. The BLASTP step is the most time-consuming step, and is done as multiple jobs with split files on several different servers. All sequence file manipulation such as format conversion and file splitting was done with the SISEQ software (version 1.30) [16]. 2.3 Organism-Optimized Clustering with Gclust Software Gclust software [18] version 3.5.2 [23] was run in the clique mode. The BLASTP results were processed in the following two steps (Figure 2): first, the data were read and partially transformed into intermediate data format and saved in a large file data.out for further analysis with various different settings of parameters. Low homology data were removed with keeping data with E-values for short sequences (from 1e-6 for >100 aa to 1e-3 for <40 aa). All single-path relations were picked up from the homology data. Domain composition of each protein was also estimated using the homology regions with different subject proteins. At this stage, multi-domain proteins as well as very large proteins (>2,000 aa, for example) were marked with a flag. In the second step, Gclust reads the data.out file, and performs clustering using the -clique option, which produces a good clustering result in a relatively short time (within one day for a dataset containing 141 organisms). In the clique mode, the homology data were converted to a structure called match, which held data of binary (i.e., protein-to-protein) similarity, namely, E-value, overlap score, and domain composition estimated as above. Normally, clique mode uses a list of organisms provided by the org list file. For each protein, all match data were tabulated in 2D, using E-value and overlap score (Figure 4A). The 2D table lists distribution of match array data using a pre-defined

Mass Identification of Chloroplast Proteins of Endosymbiont Origin 59 category scheme, which can be customized using a configuration file called var list. Match data were selected one by one starting from the initially selected best local maximum. The search scaned in a circular or diamond manner around the initial starting point. The scanning to lower overlap score and higher E-value stopped, if the number of members re-increased. This is a sign of another group of homologs with a lower similarity. This operation was done on a shadow table (Figure 4B), in which non-negative values indicate selected area, and the increasing number indicates path of search. By applying such criteria among others, a clearly defined cluster of match data with respect to E-value and overlap score was selected (boxed area). In addition, match data were selected to include as many organisms as possible but without picking up very low similarity data (the output below Figure 4A and B). After such purification of match data, a list of homologs was made for each protein. The threshold E-value and overlap score were also stored. Then, homolog clusters were formed by merging individual lists. At this stage, clusters with very diffent threshold E-values were not merged. After a repeat of merging and removing, orphan entries generated by removal step were again incorporated into the most adequate cluster. Clusters were again optimized for number of organisms. Homolog groups were sorted according to the number of entries. Finally, homolog groups were printed out to a large file as a catenated similarity matrix (Figure 5A). The matrix may be expressed in 1/0 (similar/dissimilar), E-value, and/or overlap score, depending on output options, 1, r, and/or s, respectively. 3 Results 3.1 Prediction of CPRENDOs by Phylogenetic Profiling Using a perl script homologtableg3b.pl, the homology matrix was transformed into a table showing members of each homolog group (Figure 3 and 5B). This table was used to extract homolog groups that are shared by various combinations of organisms (phylogenetic profiling). Note that proteins encoded by both organellar and nuclear genomes were included in the data set of eukaryotic organisms. Therefore, we selected organisms rather than genomes in the phylogenetic profiling. For the prediction of CPRENDOs, a data set CZ16Y containing all predicted proteins in nine species of cyanobacteria, Arabidopsis thaliana (plant) [21], Cyanidioscyzon merolae (red alga) [14], three species of photosynthetic bacteria, two species of bacteria, and two eukaryotes was used. All data were taken from the GenBank data repository, except for those of Cyanidioschyzon, which were obtained from the Cyanidioschyzon Genome Project [24]. Cyanidioschyzon is a representative of the red lineage of photosynthetic eukaryotes, and we expected that the use of a plant (green lineage) and a red alga increases accuracy of phylogenetic profiling. The homolog groups that are shared by cyanobacteria, Arabidopsis and Cyanidioscyzon were selected (Table 1). At this step, various constraints were tested in the selection. Conservation in cyanobacteria was one constraint, and allowance for presence in other organisms was another constraint. The first constraint could be complete conservation in all cyanobacteria (nine species), but many homologs of chloroplast proteins are not completely conserved in all cyanobacteria. A phylogenetic analysis suggests that plastids are sister to Anabaena-Synechocystis clade (Sato, unpublished results). Therefore, Anabaena [11] and Synechocystis [12] could be used as representatives of cyanobacteria. But we fould that all chloroplast proteins are not conserved in both cyanobacteria. We finally adopted a strategy in which any proteins conserved in a certain number of cyanobacteria were selected, irrespective of combination of cyanobacteria. The number of cyanobacteria was also a variable, but five species (out of nine) gave satisfactory results (Figure 6 and Table 1). Allowance for presence in other organisms was also tested. Table 1 compares effects of allowance in photosynthetic bacteria and non-photosynthetic organisms. Photosynthetic bacteria perform photosynthesis without oxygen evolution, with a single photosystem using machineries that are distantly related to those of cyanobacteria and plants. Therefore, the inclusion of photosynthetic bacteria could affect phylogenetic profiling of CPRENDOs. In addition, paralogs of some photosynthesisrelated proteins (ATP synthase, ribosomal proteins, and even a RuBisCO subunit) are present in

60 Sato et al. Figure 3: Flow chart of data processing for further analysis towards phylogenetic analysis. Figure 2: Flow chart of data processing until formation of homology matrix. Figure 4: Selection of match data in the clique mode. A. 2D table (rows, overlap score; lines, E-value) showing distribution of match data. A best local maximum is selected first (circle). Other local maxima with lower similarity are indicated by dotted circles. B. A shadow table for working. Zero is the start of search. Non-negative values show selected groups.

Mass Identification of Chloroplast Proteins of Endosymbiont Origin 61 Figure 5: Example output of Gclust (A) and tabular summary of homologs as generated by further processing (B). In (A), protein name (combination of genome name and gene identifier), number of amino acid residues, similarity matrix, and annotation in the original database are listed from left to right. The similarity matrix is a square matrix, having identical set of proteins in both vertical and horizontal directions. Each protein belongs to a single group. The similarity detected by BLASTP but not incorporated into the clustering is listed below the main matrix as Related groups. Each line in related groups consists of group number (number of members in parenthesis) and protein name. In (B), all homolog groups are listed with number of members in each genome. The annotation is taken from the first member.

62 Sato et al. Figure 6: A Venn diagram showing homolog groups shared by the three organism categories, Arabidopsis thaliana (green plant), Cyanidioschyzon merolae (red alga) and 5 cyanobacteria. Here, 5 Cyanos indicates >=5 of 9 cyanobacteria analyzed. This result was obtained with the selection method G shown in Table 1. In this diagram, each area is drawn proportional to the number of groups using a tcl/tk software called TriGraph (Sato, unpublished). Table 1: Number of homolog groups selected with different criteria. Number of homolog groups that are conserved in at least 5 among 9 cyanobacteria, Arabidopsis (Ath), and Cyanidioschyzon (Cme) are listed with varying additional conservation in photosynthetic bacteria (PhotoBact) and other organisms (Others). Others include C. elegans, S. cerevisiae, E. coli and B. subtilis. Number of homolog groups consisting of known chloroplast proteins or unknown proteins is listed. Each number in parenthesis indicates proportion of groups. Finally, number of members in Ath and Cme belonging to selected groups is listed. Selection PhotoBact Others of Groups Known cp proteins Unknowns of Ath proteins of Cme proteins A 0 0 84 37 (0.44) 44 (0.52) 122 103 D 0-3 0 112 51 (0.46) 55 (0.49) 185 142 E 0-3 0-1 150 66 (0.44) 72 (0.48) 293 196 F 0-3 0-2 308 148 (0.48) 97 (0.31) 706 438 G 0-3 0-3 443 218 (0.49) 127 (0.29) 1192 676

Mass Identification of Chloroplast Proteins of Endosymbiont Origin 63 non-photosynthetic organisms. These facts may make the profiling complicated. However, the results in Table 1 show that the allowance for other organisms has little effect on the proportion of clusters containing known chloroplast proteins, which was 0.44-0.49 in different selections. In contrast, the proportion of clusters containing unknown proteins decrease with increasing allowance for other organisms. These results suggest that allowance for other organisms may be as wide as possible. Conservation in cyanobacteria, Arabidopsis and Cyanidioschyzon may be, therefore, a reliable criterion for selecting CPRENDOs. A Venn diagram (Figure 6) shows that there are smaller numbers of homolog groups that are shared by cyanobacteria and Arabidopsis, or cyanobacteria and Cyanidioschyzon. These groups could include proteins conserved in only green or red lineages, and may be studied as CPRENDO-like proteins. The groups shared by Arabidopsis and Cyanidioschyzon represent eukaryotic proteins, and are not candidate for CPRENDOs. 3.2 Minimal Set of CPRENDOs To verify the predicted CPRENDOs experimentally, we planned to analyze plastid localization of the predicted CPRENDOs (see 3.3.1). The data that we used for the experimental study as described below was predicted two years ago [18] using an older version of Gclust (verion 2.1.2) and an older database (dataset CZ16). At that time, homolog groups were constructed simply using various different E-values, and the homolog groups that were conserved in eight cyanobacteria, a red alga and a green plant but not in other non-photosynthetic organisms were selected at each cutoff E-value. The selected groups were combined, and used as a minimal set of CPRENDOs (Table 3). In total, 51 homolog groups were selected. Among them, 19 were clusters of known chloroplast proteins, such as Psa and Psb proteins. The remaining 32 groups were selected as targets of initial experimental study. These homolog groups were generally included in the selection A (Table 1) of the current CZ16Y dataset, with minor inconsistency. 3.3 Experimental Verification of the Minimal Set of CPRENDOs We performed experimental verification of the minimal set of CPRENDOs to test our idea that phylogenetic profiling is useful in predicting CPRENDOs, since we believe that all informatic prediction should be experimentally verified. The experimental verification consists of the following four analyses: localization of proteins, light-regulated expression, phenotype of cyanobacterial disruptants and plant tag-lines. 3.3.1 Localization of Predicted CPRENDOs in A. thaliana Localization of the predicted CPRENDOs was analyzed by using Green Fluorescent Protein (GFP)- fusion constructs. Each construct was prepared by successive PCR and either linear DNA or plasmids were transiently transformed into onion epidermis by particle bombardment. The localization of GFPfusion protein was analyzed by fluorescence microscopy on the next day. The results (Table 2) showed that 49 out of 52 proteins were targeted to plastids. Interestingly, five proteins were also targeted to mitochondria. Such dual targeting is common in plant organellar proteins [10]. It should be noted that the localization as predicted by TargetP [6] (not shown) was generally in good agreement, six proteins were not correctly predicted to be targeted to chloroplasts.. 3.3.2 Light-Dependent Expression of the Genes for CPRENDOs in A. thaliana Expression of the predicted CPRENDOs was analyzed by RNA-blot analysis using 7-day-old seedlings, and the results are also shown in Table 2. As many as 36 genes encoding predicted CPRENDOs showed light-dependent expression, which is also expected for proteins involved in photosynthesis or chloroplast biogenesis. Nine genes were constitutively expressed, while expression of seven genes was

64 Sato et al. below the detection limit of the method employed. Cross-examination of localization and expression indicates abundance (31 proteins) of light-regulated chloroplast proteins. 3.3.3 Analysis of Synechocystis Disruptans The genes for the cyanobacterial homologs of predicted CPRENDOs were disrupted in Synechocystis sp. PCC 6803. For this purpose, a rapid method of preparation of disruption construct was developed using repeated PCR. Among the 41 genes, 33 were disrupted completely, while five were not completely segregated, and might represent essential genes. Three constructs were not successfully made, due to technical problems in PCR ( PCR problem in Table 3). Table 2: Summary of localization and expression of predicted minimal set of CPRENDOs. Cp, chloroplasts (plastids); Mt, mitochondria; Cyto, cytoplasm; nuc, nucleus. L > D, expression in the light was higher than that in the dark; L = D, expression was comparable in the light and the dark; No exp, no expression was detected by RNA-blot analysis. Expression Localization L > D L = D No Exp. Cp 44 31 8 5 Cp & Mt 5 3 0 2 Mt 1 1 0 0 Cyto & nuc 2 1 1 0 Total 52 36 9 7 Fluorescence induction kinetics was measured as an indicator of photosynthetic performance (Table 3). Growth defect was also noted for some disruptants. In 22 disruptants, defects in growth or fluorescence kinetics was noted. These results suggest that the selected genes are important for the normal growth in cyanobacteria. 3.3.4 Analysis of A. thaliana Mutant Lines Mutants of the predicted CPRENDOs were analyzed using the SALK T-DNA tag-lines [25]. The analysis is still in progress, but we obtained homozygous lines for 25 CPRENDOs. During our experiments in the past two years, reports were published on four of the CPRENDOs, namely, Tab2 (in Chlamydomonas), Psb29/Thf1, APE1 and HY2. These are not components of photosynthetic machinery except Psb29, but are involved in its biogenesis. This demonstrates the correctness of our strategy, and many of the remaining CPRENDOs are also likely to be important in the biogenesis of photosynthetic machinery. However, only two of the Arabidopsis mutant lines showed visible phenotypes, such as variegation. The CPRENDO gene in one of them has been already annotated as ycf65, a hypothetical chloroplast reading frame, because it is encoded in the chloroplast genome in some algae such as Cyanidioschyzon. A mutant of ycf65 in Synechocystis also showed growth defect. Ycf65 protein is likely to be important in both chloroplasts and cyanobacteria. 4 Discussion 4.1 Evaluation of the Prediction Strategy of CPRENDOs The present study shows that phylogenetic profiling is useful in predicting CPRENDOs. Essential methodology for predicting CPRENDOs consists of (1) constructing homolog groups from total predicted proteins of both photosynthetic and non-photosynthetic organisms, and (2) selecting groups that are conserved in photosynthetic organisms under appropriate constraints. A probable estimate of

Mass Identification of Chloroplast Proteins of Endosymbiont Origin 65 Table 3: Summary of results of functional analysis. Phenotypes of Synechocystis disruptants, localization and light regulation of Arabidopsis genes, and the number of homozygous tag-lines are listed. For Synechocystis mutants, mutant ID is indicated with phenotypes (segregation state, growth properties, and fluorescence properties, in this order), if present. Localization is shown in abbreviated words (see Table 2). Blue underline indicates light-regulated expression. Confirmed CPRENDOs are marked by bold characters. For homozygous tag-lines, visible phenotype is marked by bold red characters. In annotation, Ycf stands for hypothetical chloroplast ORF. ID Function Synechocystis mutants Arabidopsis reported during the work Annotation Mutant ID: Phenotype (segregation, growth, fluorescence) # of localization (UL=L>D, Bold=CPRENDO) Ath, # of homozygous 5 Hypothetical 10; 11 2 cp, 1 mt 3 6 Hypothetical 12; (13: PCR problem) 1 cp 7 Ycf52 6: -, slow, - 2 cp 1 8 Hypothetical 14 2 cp 3 9 Yes Tab2 (Chlamydomonas) 10 Hypothetical tag-lines 15; (16: PCR problem) 1 cp Not available 17: incomplete, light sensitive, low peak 12 Hypothetical (18: PCR problem) 13 Membrane protease 14 Hypothetical 19: -, light sensitive, very low peak 20: -, slow, high second peak; 21: -, -, Low peak 1 cp 1 cp, 1 cp, 1 (cp), 1 nuc 1 cp, 1 cp, 1 (cp, mt) 15 Hypothetical 22: -, slow, low peak 1 cp 16 Probable ferredoxin (2Fe-2S) 23: -, -, low second peak 1 (cp) 1 cp 1 18 Ycf19 3 2 cp 2 32 Ycf19-like 4: incomplete, -, - 1 cp Not available 19 Hypothetical 24 1 (cp) 21 Ycf60 25: -, pale green and light sensitive, - 1 cp, 1 cp 2 22 Hypothetical 26: -, -, low peak 1 (cp, mt) 1 23 Ycf65 27: -, slow, - 2 cp 1, 1 27 Hypothetical 28 1 (cp) 33 Yes Psb29/Thf1/APG5 29 (sll1414): no phenotype 2g20890 (cp) (Not tried) 34 Rubredoxin 30: -, -, low peak 1 cp Not available 35 Hypothetical 31: incomplete, slow, high peak 1 cp 1 39 Yes APE1 32(slr0575): -, -, low peak 5g38660 (cp) (Not tried) 40 Hypothetical 8, -, slow, - 1 cp 1 41 Hypothetical 33: -, -, very low peak 1 cp, 1 (cp) Not available 43 Hypothetical 34 1 cp Not available 44 Hypothetical 35: -, -, no decrease after peak 1 cp Not available 3

66 Sato et al. ID Table 4: Continuation of Table 3. Function Synechocystis mutants Arabidopsis reported during the work 46 Yes Annotation HY2 (phycobilin synthesis) Mutant ID: Phenotype (segregation, growth, fluorescence) 36(slr0116): incomplete, -, high peak # of localization (UL=L>D, Bold=CPRENDO) 3g09150 (1 cp) Ath, # of homozygous tag-lines (Not tried) 47 Ycf20 37: -, slow, - 3 (cp, mt) 2 49 Hypothetical 38: -, slow, - 1 cp 51 Hypothetical 38 3 cp 3 54 Hypothetical 40 1 cp 55 Hypothetical 59 ATP-dependent proteinase 62 Hypothetical 41: -, slow, no decrease after peak 1 (cp) 45 2 cp 47: incomplete, light sensitive and slow, - 1 (cp) the number of CPRENDOs is 1192 in Arabidopsis and 676 in Cyanidioscyzon. A previous study [13] estimated the upper limit of chloroplast proteins of endosymbiont origin as about 4,500 in Arabidopsis, and another study [1] suggested about 650-900 plant proteins originated from cyanobacterial endosymbiont. A more recent estimate was about 880 [15]. These estimates were done by calculation, but not by complete enumeration. These studies also showed that a significant proportion of proteins of cyanobacterial origin might be located in non-chloroplast compartment, which is not the case in our result. This could be partly due to the limitation of targeting prediction [15], but also to the inaccuracy in the prediction. In contrast, the results of present study on the minimal set of predicted CPRENDOs clearly indicate that almost all of them are chloroplast proteins, although no targeting prediction was used in the prediction process. A reasonable explanation of the discrepancy may be that we used conservation in 5 cyanobacteria, Arabidopsis, and Cyanidioschyzon as a criterion, while previous studies used conservation in only Arabidopsis and Synechocystis, or a similar simple criterion, which overestimates number of proteins conserved in plants and cyanobacteria. In addition, these previous studies used simple one plant vs one cyanobacterium relationship using a single cutoff E-value for all proteins. Our approach using phylogenetic profiling based on homolog groups gives robust clusters, which could yield a more solid prediction. 4.2 General Usefulness of Phylogenetic Profiling General success of our approach of comparative genomics prompted us to extend phylogenetic profiling to prediction of various other proteins that are conserved in a certain group of organisms. Prediction of pathogenicity-related proteins was done in various bacterial groups including strains with or without pathogenicity [8, 9]. Such analysis might not need sophisticated strategy of genomic comparison. But identification of proteins, which are conserved in a wide range of organisms that are not closely related phylogenetically, requires a solid clustering and phylogenetic profiling. The phylogenetic profiling with Gclust database will be a powerful tool for identifying plant-specific proteins and proteins specific to flowering plants, if more plant genomic sequences are available.

Mass Identification of Chloroplast Proteins of Endosymbiont Origin 67 References [1] Abdallah, F., Salamini, F., and Leister, D., A prediction of the size and evolutionary origin of the proteome of chloroplasts of Arabidopsis, Trends Plant Sci., 5:141 142, 2000. [2] Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J., Gapped BLAST and PSI-BLAST: A new generation of protein database search programs, Nucleic Acids Res., 25:3389 3402, 1997. [3] Bansal, A.K. and Meyer, T.E., Evolutionary analysis by whole-genome comparisons, J. Bacteriol., 184:2261 2272, 2002. [4] Cavalier-Smith, T., Genomic reduction and evolution of novel genetic membranes and proteintargeting machinery in eukaryote-eukaryote chimaeras (meta-algae). Phil. Trans. R. Soc. Lond., 358B:109 134, 2003. [5] Eisen, J.A., Assessing evolutionary relationships among microbes from whole-genome analysis, Curr. Opinion Microbiol., 3:475 480, 2000. [6] Emanuelsson, O., Nielsen, H., Brunak, S., and von Heijne, G., Predicting subcellular localization of proteins based on their N-terminal amino acid sequence, J. Mol. Biol., 300:1005 1016, 2000. [7] House, C.H. and Fitz-Gibbon, S.T., Using homolog groups to create a whole-genomic tree of free-living organisms: An update, J. Mol. Evol., 54:539 547, 2002. [8] Janssen, P.J., Audit, B., and Ouzounis, C.A., Strain-specific genes of Helicobacter pylori: Distribution, function and dynamics, Nucleic Acids Res., 29:4395 4404, 2001. [9] Jin, Q., et al., Genome sequence of Shigella flexneri 2a: Insights into pathogenicity through comparison with genomes of Escherichia coli K12 and O157, Nucleic Acids Res., 30:4432 4441, 2002. [10] Kabeya, Y. and Sato, N., Unique translation initiation at the second AUG codon determines mitochondrial localization of the phage-type RNA polymerases in the moss Physcomitrella patens. Plant Physiol., 138:369 382, 2005. [11] Kaneko, T., et al., Sequence analysis of the genome of the unicellular cyanobacterium Synechocystis sp. strain PCC6803. II. Sequence determination of the entire genome and assignment of potential protein-coding regions, DNA Res., 3:109 136, 1996. [12] Kaneko, T., et al., Complete genomic sequence of the filamentous nitrogen-fixing cyanobacterium Anabaena sp. strain PCC 7120, DNA Res., 8:205 213, 2001. [13] Martin, W., Rujan, T., Richly, E., Hansen, A., Cornelsen, S., Lins, T., Leister, D., Stoebe, B., Hasegawa, M., and Penny, D., Evolutionary analysis of Arabidopsis, cyanobacterial, and chloroplast genomes reveals plastid phylogeny and thousands of cyanobacterial genes in the nucleus. Proc. Nat. Acad. Sci. USA, 99:12246 12251, 2002. [14] Matsuzaki, M., et al., Genome sequence of the ultrasmall unicellular red alga Cyanidioschyzon merolae 10D, Nature, 428:653 657, 2004. [15] Richly, E. and Leister, D., An improved prediction of chloroplast proteins reveals diversities and commonalities in the chloroplast proteomes of Arabidopsis and rice, Gene, 329:11 16, 2004. [16] Sato, N., SISEQ: Manipulation of multiple sequence and large database files for common platforms, Bioinformatics, 16:180 181, 2000.

68 Sato et al. [17] Sato, N., Was the evolution of plastid genetic machinery discontinuous?, Trends Plant Sci., 6:151 155, 2001. [18] Sato, N., Comparative analysis of the genomes of cyanobacteria and plants, Genome Inform., 13:173 182, 2002. [19] Sato, N., Gclust: Genome-wide clustering of protein sequences for identification of photosynthesisrelated genes resulting from massive horizontal gene transfer, Genome Inform., 14:585 586, 2003. [20] Sato, N. and Ishikawa, M., Identification of novel chloroplast proteins of endosymbiotic origin by phylogenetic profiling using homolog groups, Abstract Book of GIW2004, P139, 2004. [21] The Arabidopsis Genome Initiative, Analysis of the genome sequence of the flowering plant Arabidopsis thaliana, Nature, 408:796 815, 2000. [22] Wang, D. Y. C., Kumar, S., and Hedges, S. B., Divergence time estimates for the early history of animal phyla and the origin of plants, animals and fungi. Proc. Biol. Sci., 266B:163 171, 1999. [23] http://nsato4.c.u-tokyo.ac.jp/old/gclust/gclust.html/ [24] http://merolae.biol.s.u-tokyo.ac.jp/ [25] http://signal.salk.edu/tabout.html