The Chloroplast Genome of Nymphaea alba: Whole-Genome Analyses and the Problem of Identifying the Most Basal Angiosperm

Similar documents
Removal of Noisy Characters from Chloroplast Genome-Scale Data Suggests Revision of Phylogenetic Placements of Amborella and Ceratophyllum

Phylogenetic relationship among S. castellii, S. cerevisiae and C. glabrata.

Supplemental Information. Full transcription of the chloroplast genome in photosynthetic

The Phylogenetic Reconstruction of the Grass Family (Poaceae) Using matk Gene Sequences

Plastid Genome Phylogeny and a Model of Amino Acid Substitution for Proteins Encoded by Chloroplast DNA

C3020 Molecular Evolution. Exercises #3: Phylogenetics

Third-codon transversion rate-based Nymphaea basal angiosperm phylogeny -- concordance with developmental evidence

Dr. Amira A. AL-Hosary

Constructing Evolutionary/Phylogenetic Trees

8/23/2014. Phylogeny and the Tree of Life

CHAPTERS 24-25: Evidence for Evolution and Phylogeny

Letter to the Editor. Department of Biology, Arizona State University

The origin of flowering plants and characteristics of angiosperm

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

Effects of Gap Open and Gap Extension Penalties

Constructing Evolutionary/Phylogenetic Trees

Assessing an Unknown Evolutionary Process: Effect of Increasing Site- Specific Knowledge Through Taxon Addition

Phylogeny and systematics. Why are these disciplines important in evolutionary biology and how are they related to each other?

The Phylogenetic Handbook

Efficiencies of maximum likelihood methods of phylogenetic inferences when different substitution models are used

Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2018 University of California, Berkeley

Phylogenetic Tree Reconstruction

Phylogenetic analyses. Kirsi Kostamo

7. Tests for selection

LAB 4: PHYLOGENIES & MAPPING TRAITS

Consensus Methods. * You are only responsible for the first two

Chapter 26: Phylogeny and the Tree of Life Phylogenies Show Evolutionary Relationships

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/26

Maria V. Yamburenko, Yan O. Zubo, Radomíra Vanková, Victor V. Kusnetsov, Olga N. Kulaeva, Thomas Börner

Likelihood Ratio Tests for Detecting Positive Selection and Application to Primate Lysozyme Evolution

Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/22

Flowering plants (Magnoliophyta)

Lecture 6 Phylogenetic Inference

Molecular phylogeny - Using molecular sequences to infer evolutionary relationships. Tore Samuelsson Feb 2016

How to read and make phylogenetic trees Zuzana Starostová

FUNDAMENTALS OF MOLECULAR EVOLUTION

Phylogenetic inference

Cladistics and Bioinformatics Questions 2013

Beyond Reasonable Doubt: Evolution from DNA Sequences

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

The origin of angiosperms has long been considered a fundamental

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

SPECIATION. REPRODUCTIVE BARRIERS PREZYGOTIC: Barriers that prevent fertilization. Habitat isolation Populations can t get together

7.1 Introduction. Summary

C.DARWIN ( )

Bio 1B Lecture Outline (please print and bring along) Fall, 2007

Reconstructing the history of lineages

Another Look at the Root of the Angiosperms Reveals a Familiar Tale

ASSESSING AMONG-LOCUS VARIATION IN THE INFERENCE OF SEED PLANT PHYLOGENY

Phylogenetic Analysis

Organelle genome evolution

Need for systematics. Applications of systematics. Linnaeus plus Darwin. Approaches in systematics. Principles of cladistics

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

ANGIOSPERM DIVERGENCE TIMES: THE EFFECT OF GENES, CODON POSITIONS, AND TIME CONSTRAINTS

PhyQuart-A new algorithm to avoid systematic bias & phylogenetic incongruence

Algorithms in Bioinformatics

Inferring phylogeny. Today s topics. Milestones of molecular evolution studies Contributions to molecular evolution

Phylogenetics. Applications of phylogenetics. Unrooted networks vs. rooted trees. Outline

Phylogenetic Analysis

Phylogenetic Analysis

ESS 345 Ichthyology. Systematic Ichthyology Part II Not in Book

Lecture 11 Friday, October 21, 2011

Letter to the Editor. The Effect of Taxonomic Sampling on Accuracy of Phylogeny Estimation: Test Case of a Known Phylogeny Steven Poe 1

Sequenced Mitochondrial Genomes of Bryophytes

Chapter 26 Phylogeny and the Tree of Life

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Phylogenetics. BIOL 7711 Computational Bioscience

Estimating Divergence Dates from Molecular Sequences

UoN, CAS, DBSC BIOL102 lecture notes by: Dr. Mustafa A. Mansi. The Phylogenetic Systematics (Phylogeny and Systematics)

PHYLOGENY AND SYSTEMATICS

Chapter 16: Reconstructing and Using Phylogenies

Lecture 4. Models of DNA and protein change. Likelihood methods

The MADS-Box Floral Homeotic Gene Lineages Predate the Origin of Seed Plants: Phylogenetic and Molecular Clock Estimates

(Stevens 1991) 1. morphological characters should be assumed to be quantitative unless demonstrated otherwise

Phylogenetic Trees. Phylogenetic Trees Five. Phylogeny: Inference Tool. Phylogeny Terminology. Picture of Last Quagga. Importance of Phylogeny 5.

What is Phylogenetics

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Name. Ecology & Evolutionary Biology 2245/2245W Exam 2 1 March 2014

Inferring phylogeny. Constructing phylogenetic trees. Tõnu Margus. Bioinformatics MTAT

BMC Evolutionary Biology

Investigation 3: Comparing DNA Sequences to Understand Evolutionary Relationships with BLAST

Name: Class: Date: ID: A

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

Chapter 19: Taxonomy, Systematics, and Phylogeny

PGA: A Program for Genome Annotation by Comparative Analysis of. Maximum Likelihood Phylogenies of Genes and Species

"PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200B Spring 2009 University of California, Berkeley

Estimating Phylogenies (Evolutionary Trees) II. Biol4230 Thurs, March 2, 2017 Bill Pearson Jordan 6-057

Distances that Perfectly Mislead

Algorithmic Methods Well-defined methodology Tree reconstruction those that are well-defined enough to be carried out by a computer. Felsenstein 2004,

MOLECULAR PHYLOGENY AND GENETIC DIVERSITY ANALYSIS. Masatoshi Nei"

Molecular Clocks. The Holy Grail. Rate Constancy? Protein Variability. Evidence for Rate Constancy in Hemoglobin. Given

Phylogenetics in the Age of Genomics: Prospects and Challenges

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

Chapter 27: Evolutionary Genetics

Lecture Notes: Markov chains

Transcription:

The Chloroplast Genome of Nymphaea alba: Whole-Genome Analyses and the Problem of Identifying the Most Basal Angiosperm Vadim V. Goremykin,* Karen I. Hirsch-Ernst, Stefan Wölfl,à and Frank H. Hellwig* *Institut für Spezielle Botanik, Universität Jena, Jena, Germany; Zentrum Pharmakologie und Toxikologie, Universität Göttingen, Göttingen, Germany; and àklinik für Innere Medizin, Universität Jena, Jena, Germany Angiosperms (flowering plants) dominate contemporary terrestrial flora with roughly 250,000 species, but their origin and early evolution are still poorly understood. In recent years, molecular evidence has accumulated suggesting a dicotyledonous origin of monocots. Phylogenetic reconstructions have suggested that several dicotyledonous groups that include taxa such as Amborella, Austrobaileya, and Nymphaea branch off as the most basal among angiosperms. This has led to the concept of monocots, eudicots, basal dicots, and ANITA groupings. Here, we present the sequence and phylogenetic analyses of the chloroplast DNA of Nymphaea alba. Phylogenetic analyses of our 14-species data set, consisting of 29,991 aligned nucleotide positions per chloroplast genome, revealed consistent support for Nymphaea being a divergent member of a monophyletic dicot assemblage. Three distinct angiosperm lineages were supported in the majority of our phylogenetic analyses eudicots, Magnoliopsida, and monocots. However, the monocot lineage leading to the grasses was the deepest branching. Although analyses of only one individual gene alignment (out of 61) is consistent with some recently proposed hypotheses for the paraphyly of dicots, we also report observations that nine genes do not support paraphyly of dicots. Instead, they support the basal monocot-dicot split. Consistent with this finding, we also report observations suggesting that the monocot lineage leading to the grasses has the strongest phylogenetic affinity to gymnosperms. Our findings have general implications for studies of substitution model specification and analyses of concatenated genome data. Introduction Key words: Nymphaea, chloroplast genomes, angiosperms, gymnosperms, molecular evolution, substitution rates. E-mail: Vadim.Goremykin@uni-jena.de. Mol. Biol. Evol. 21(7):1445 1454. 2004 doi:10.1093/molbev/msh147 Advance Access publication April 14, 2004 A new consensus view of higher-level angiosperm systematics is currently emerging, based mostly on the analysis of three genes (reviewed in Savolainen and Chase [2003]). In this view, the dicot lineages, including Nymphaea and Amborella, are regarded as the deepest branching among angiosperms. Recent prominent studies (Adams et al. 2002; Bergthorsson et al. 2003) have taken for granted that this consensus is correct. In our previous studies involving chloroplast genome analyses (Goremykin et al. 2003a, 2003b), we noticed that although the optimal symmetrical model (GTR 1 I 1 ÿ) identified by Modeltest (Posada and Crandall 1998) on concatenated chloroplast data was consistent with a basal position of the magnoliids Calycanthus and Amborella, tree building with many other symmetric substitution models did not support this hypothesis. In fact, an alternative hypothesis placing monocots as basal was in most analyses favored with 100% nonparametric bootstrap support. In our paper, we raised the concern of model specification and whether or not the best-fitting symmetric model for our concatenated data gave the most biologically realistic result. In an effort to shed more light on these earlier findings, we have sequenced the chloroplast genome of Nymphaea alba. It was hoped that addition of this putatively basal species would help break up long branches and, thus, help stabilize the angiosperm tree topology. Nymphaea alba belongs to Nymphaeaceae (Nymphaeales). Like Amborella, Nymphaeales (excluding Nelumbonaceae) have no vessels (Cronquist 1981; Takhtajan 1966). Their position has been viewed by classical taxonomists as somehow intermediate between monocotyledonous and dicotyledonous flowering plants. Some botanists (Rohweder and Endress 1983) noted certain similarities such as floral organs in trimerous whorls uniting members of this order (Cabomba) and primitive monocotyledons (Alisma). Others considered the morphological similarities between Nymphaeales and the monocots (rhizodermis differentiated into short and long cells, dispersed vascular bundles, compound midrib, and operculate seeds) important enough to actually include the Nymphaeales into the monocotyledons (Schaffner 1904, 1934; Guttenberg and Müller-Schröder 1958; Haines and Lye 1975). A detailed investigation of embryonal development of Nymphaeales (Lodkina 1988) revealed asynchronous and asymmetric development of cotyledons, which was interpreted by the author as evidence of an ancestral position of Nympheales to monocots. Indeed, a number of scientists for different reasons believed that Nymphaeales belong to the stock of early lineages from which monocots arose (Arber 1920; Takhtajan 1973; Cronquist 1981; Dahlgren and Clifford 1982). Many character states of Nymphaeales considered to be primitive (e.g., uniaperturate pollen, apocarpous gyneceum, numerous stamens, and laminar placentation) point to the antiquity of the order. The complex polymerous flowers of water lilies with no clear differentiation between sepals and carpels fit very well into the euanthian school of flower origin theories. Fossil record supports the great age of Nymphaeales. Friis, Pedersen, and Crane (2001) found fossilized nymphaealean flowers dating back to the early Cretaceous period (125 to 115 Myr) and belonging to the oldest fossil assemblages that contain unequivocal angiosperm stamens and carpels. The representatives of this order tend to be among the first taxa to diverge from the other angiosperm lineages in many molecular studies. They are either included in the ANITA (Amborella, Nymphaeales, and Illiciales-Trimeniales-Aristolochiales) grade (Qiu et al. 1999; Soltis, Soltis, and Chase 1999; Barkman et al. 2000; Graham and Olmstead 2000; Zanis et al. 2002) or form a Molecular Biology and Evolution vol. 21 no. 7 Ó Society for Molecular Biology and Evolution 2004; all rights reserved.

1446 Goremykin et al. clade with Amborella,which is a sister to all other angiosperms (Barkman et al. 2000; Graham and Olmstead 2000; Zanis et al. 2002). Recent publications have not been able to discriminate between those hypotheses. It seems that resolution of this issue will require larger amounts of sequence data. The data set of the 61 chloroplast-coding genes we present and investigate here is the largest available to date for addressing the issue of early-branching angiosperms. Materials and Methods Genomic Sequencing Nymphaea alba leaves were harvested from a plant growing in the botanical gardens of the University of Jena, Germany. Total DNA was extracted from the leaves employing the CTAB method (Murray and Thompson 1980) and further purified with Quiagen columns (Quiagen, Valencia, Calif.) according to the manufacturer s protocol. The Nymphaea alba plastome sequence was amplified by implementing a long-range PCR strategy with primers developed from the alignment of known chloroplast genome sequences, as described previously (Goremykin et al. 2003a, 2003b). We covered the entire Nymphaea alba chloroplast genome with PCR products, which exhibited lengths between approximately 4 to 20 kb. The inverted repeat regions of the cpdna were amplified separately, each with two PCR products extending to the flanking sequences of the single-copy regions and overlapping in the middle of the respective repeat. PCR products were purified by electrophoresis through lowmelting agarose followed by digestion with agarase and were subsequently sheared by nebulization, yielding fragments of 0.5 to 1.5 kb in length. The fragments were cloned into the 4Blunt-TOPO vector employing the TOPO Shotgun Subcloning kit (Invitrogen, Groningen, The Netherlands), according to the manufacturer s protocol. Recombinant plasmids containing individual fragments were isolated from transformed E. coli clones with the Montage Plasmid Miniprep kit (Millipore, Eschborn, Germany). Sequencing reactions with plasmid DNA were prepared using the Big Dye Terminator sequencing kit (ABI, Foster City, Calif.). Automated sequencing was performed on ABI 3100, ABI 377 (ABI), and MegaBACE 1000 (Amersham/Pharmacia Biotech, Uppsala, Sweden) sequencers. Sequence Assembly and Annotation All automated sequencer traces were base-called with the PHRED program (Ewing et al. 1998), and sequence masking and assembly were performed with the STADEN package (Staden, Beal, and Bonfield 2000). Sequencing data were accumulated to 103 coverage for all PCR fragments; remaining gaps were closed by PCR. The Nymphaea alba chloroplast genome sequence has been deposited in the EMBL database under the accession number AJ627251. The primer sequences used for the amplification of plastome sequences by PCR and the alignments employed for phylogenetic analyses are available upon request. The genome was annotated as described previously (Goremykin et al. 2003b). Results General Genome Properties The chloroplast DNA of Nymphaea alba is a 159,930- bases-long circular molecule, which has a structure typical for many land plants large and small single-copy regions separated by inverted repeat regions. Using PHRED confidence values, we determined the total plastome assembly to contain 0.03 incorrectly read bases, which suggests no mistakes in the genomic sequence. The G1C content of the plastome is 39.15%, which is close to that of the other angiosperms (Amborella has, for example, 38.3% G1C, Nicotiana has 37.8%, and Zea has 38.4%). The G1C content of the protein-coding genes of known function found on the Nymphaea alba cpdna is close to the overall one (40.1%). However, the above bases are not uniformly distributed across the different codon positions. The synonymous third codon positions have a G1C content of 31.5%, whereas the first and the second codon positions have a G1C content closer to uniform (i.e., 44.5%). Both the gene order and the gene content of the genome under study are identical to those of Amborella trichopoda cpdna (Goremykin et al. 2003a) and are very similar to the gene order and the gene content of Calycanthus fertilis cpdna (Goremykin et al. 2003b), with the exception of the hypothetical ACR-toxin sensitivity gene (ACRS) open reading frame (ORF) found in the latter genome and shorter inverted repeat region of the Calycanthus plastome, not including the rpl2 gene. The gene map of the Nymphaea alba cpdna is presented in figure 1. Alignment and Data Properties The chloroplast genome of Nymphaea alba contains all 61 genes common to completely sequenced chloroplast genomes of the land plants (Ohyama et al. 1986; Shinozaki et al. 1986; Hiratsuka et al. 1989; Wakasugi et al. 1994; Maier et al. 1995; Sato et al. 1999; Hupfer et al. 2000; Kato et al. 2000; Schmitz-Linneweber et al. 2001; Ogihara et al. 2002; Goremykin et al. 2003a, 2003b). Individual alignments of the first and the second codon positions of 61 genes and alignments of their translated sequences were produced with our ClustalW-embedded Perl script. They were manually concatenated and edited to produce a 29,991-position-long nucleotide alignment and a 14,811 amino acid alignment used in phylogenetic analyses. In the nucleotide alignment, we excluded the third codon positions because it could pose problems in phylogeny reconstruction for the application of fitting and tree building with symmetric substitution models. These sites tend to exhibit high and irregular AT-contents and were found to be very divergent in comparison Pinus vs. angiosperms (Goremykin et al. 2003a, 2003b). Even at the first 1 second codon positions, several species in the data set do not pass a 5% chi-square test of compositional homogeneity. This also raises some concern for phylogenetic analysis of angiosperm/outgroup 112 sequence data sets, which we address. Analyses of Concatenated Nucleotide Alignment The tree depicting the inferred phylogenetic relationship of the species under analysis is presented in figure 2.

The Chloroplast Genome of Nymphaea alba 1447 FIG. 1. Nymphaea alba cpdna. The topmost part of the map corresponds to the start and the end of the EMBL sequence entry AJ627251. Genes shown inside the circle are transcribed clockwise, and genes outside the circle are transcribed counterclockwise. The genes of the genetic apparatus are shown in red, photosynthesis genes are indicated as green, and genes of NADH dehydrogenase are shown in violet. The ORFs, ycfs, and genes of unknown function are designated as gray. Intron-containing genes (names of which are indicated in blue) are represented by their exons. In the cases when two genes overlap, one of them is shifted off the map to show its position. This topology was found in distance analysis of the 29,991 position alignment of the first and the second coding positions of 61 chloroplast genes employing Tajima-Nei substitution model as implemented in the Treecon package (Van de Peer et al. 1994) and further confirmed in distance analyses with Jukes-Cantor, Kimura two-parameter Felsenstein F81, Felsenstein F84, Kimura three-parameter, Hasegawa, Kishino and Yano, Tajima-Nei, Tamura-Nei, and General time-reversible (GTR) models as implemented in the PAUP* package (Swofford 2002), both with and without gamma correction (employing alpha shape parameter 0.27). The branches uniting dicotyledons,

1448 Goremykin et al. ((outgroups (Calycanthus (Nymphaea, Amborella)))(monocots, eudicots)) with exactly the same settings and models that we used with Tree-Puzzle. The branch separating outgroups with magnoliids from the rest of the species received low (,60%) QPS support in these analyses. As with our earlier findings (Goremykin et al. 2003a), the best fit model found by the Modeltest program (Posada and Crandall 1998) suggested a General Time-Reversible model with positional rate heterogeneity. Tree building under heuristic ML (PAUP*) with the optimal symmetric model yielded a topology with a clade bearing Nymphaea and Amborella branching first among the angiosperms, followed by Calycanthus, and then by the dichotomy monocots-eudicots. However, as with the findings reported previously, deviations in the choice of model and parameter values gave trees wherein branches united dicotyledons, magnoliids, and Nymphaea with Amborella. In these analyses monocots were basal. FIG. 2. Neighbor-joining tree built from Tajima-Nei distances derived from analysis of the alignment of the first and the second codon positions from 61 protein-coding genes common to the plastomes of land plants. magnoliids and Nymphaea with Amborella were recovered with 100/100 bootstrap proportion (BP) support in all above analyses. Several species in the data set do not pass the 5% chisquare test of compositional homogeneity. The compositional biases in these sequences, however, would not be expected to affect the phylogeny reconstruction because the topology presented on figure 2 was also recovered in the LogDet analysis with 100/100 BP support for all aforementioned branches. This topology was further confirmed in maximumparsimony analyses performed with heuristic and branch and bound searches. The bootstrap values supporting the monophyly of dicotyledonous plants, monophyly of Magnoliopsida, and sister group relationship between Amborella and Nymphaea remained on the maximum level in these analyses. The same result was recovered in maximum-likelihood (ML) analyses performed with the Tree-Puzzle program (Strimmer and von Haeseler 1996). The ML tree built with the Hasegawa, Kishino, and Yano model of substitution was congruent to the topology shown on figure 2. The monophyletic status of dicots, Magnoliids, and Nymphaea with Amborella received, respectively, 95, 98, and 99 quartet puzzling support (QPS). Applying the Tamura-Nei model of substitution resulted in no changes in topology and in a slight change of support for the above three branches: 95, 97, 99 QPS (respectively). The hypothetical branch uniting all angiosperms under analysis with the exception of Nymphaea and Amborella, a branch that would be in compliance with the ANITA grade hypothesis, received no support in these QP ML analyses. The heuristic ML searches performed with PAUP* yielded different results. The quartet-puzzling algorithm implemented in this program found an alternative topology Analyses of Concatenated Amino Acid Alignment These results were further tested with analyses of the 14,811-position-long alignment of the translated sequences. Heuristic search employing maximum-parsimony algorithm (PAUP*) yielded the topology congruent with monocots basal (fig. 2). The branch bearing Nymphaea and Amborella and the one uniting all dicotyledonous plants were found in 100/100 bootstrap trees, whereas the branch bearing the three magnoliids was recovered in 99/100 bootstrap replicas. Distance analyses of the protein alignment were performed with the Treecon package employing Kimura and Tajima-Nei models and with the PHYLIP package (Felsenstein 1989) with Dayhoff model. These analyses resulted in topologies identical to the one presented on figure 2 with 100/100 BP support values for the three aforementioned branches. The neighbor-net (Bryant and Moulton 2004) tree built with the Protein LogDet method (Tholesson 2004) with all amino acid sites included had the topology congruent to the one shown in figure 2. Maximum-likelihood analyses were performed with the Tree-Puzzle program with default settings and root assigned to Marchantia. The branching order of the eudicot clades (Nicotiana/Spinacia), (Arabidopsis/Lotus), and the one leading to Oenothera could not be resolved in all ML analyses. The eudicot monophyly though, as well as monophyly of magnoliids, dicots, and the sister group relationship between Amborella and Nymphaea, received strong support. With Müller-Vingron, BLOSUM, Adachi- Hasegawa, Dayhoff, and Jones and Jones substitution models, the lowest QPS value supporting the eudicot branch in the above ML analyses was 96 QPS. The single lowest quartet-puzzling support value obtained for the other three branches was 97. Analyses of the Individual Alignments Rearrangements in 11 plastomes of the spermatophytes affecting gene order involve large chunks of DNA. Therefore, on the level of at least spermatophytes, orthology of every gene under analysis as well as common evolutionary history of all 61 genes can easily be proved

The Chloroplast Genome of Nymphaea alba 1449 on the basis of gene order identity and general sequence similarity along the large stretches of cpdna from different species. Yet the land plant tree topologies derived from the chloroplast genes are often different. To investigate possible phylogenetic biases of individual genes, we counted the bootstrap values supporting the branches bearing outgroups with Nymphaea, Amborella or with Nymphaea and Amborella taken together, Calycanthus, Magnoliopsida, Rosopsida, and grasses in NJ/ GTR trees built from the alignments of the first and the second codon positions from 61 coding genes common to the genomes of the land plants. The results of these analyses are shown in table 1. Here, we label a certain branch to be supported by a protein alignment when corresponding BP support value is no lower than the arbitrary value of 60%. One can see that one of six branches is supported by a much larger number of genes than the other five. This is the branch uniting outgroups with grasses that we recovered from the majority of analyses of the concatenated data set. This branch found some support in the analysis of atpf (67% BP), matk (60% BP), rpob (99% BP), rpoc1 (100% BP), rpoc2 (99% BP), rps12 (61% BP), rps18 (76% BP), rps3 (77% BP), and rps8 alignments (90% BP). These alignments exhibit, respectively, 117, 479, 483, 357, 939, 24, 50, 146, and 81 informative positions. We found no clear cases of support on the level of individual proteins for the branches uniting outgroups with (1) Nymphaea, (2) Amborella, and (3) Magnoliopsida. The branches bearing outgroups with (1) Nymphaea 1 Amborella, (2) Calycanthus, and (3) Rosopsida were supported each by a single protein, by, respectively, 74% BP (psbf), 73% BP (psaj), and 70% BP ( ycf 3). The number of informative positions in alignments of these proteins are, respectively, 7, 17, and 35. Additional Tests We wished to evaluate support for the clade grouping the monocotyledoneous species with outgroups in our analyses. One simple way of testing outgroup affinity of different angiosperm branches would be to count the number of positions in which a group to be tested share the same character state with the most closely related outgroup that is not observed in all other angiosperm sequences. Given low level of homoplasy (for example by excluding the highly homoplastic third positions), one would expect such positions in a group to be tested to contain the ancestral character states existing before the splitting of angiosperms and the outgroup that subsequently mutated in other angiosperm lineage. Another way to test the basal monocot-dicot split would be to count the number of synapomorphies supporting alternative branches within the angiosperm ingroup. One can expect more synapomorphies between more closely related ingroup taxa. We found the concatenated alignment of the first and the second codon positions sampled from 61 chloroplast genes to contain 68 positions with bases shared between three grasses and Pinus to the exclusion of other angiosperms. Three species of magnoliopsida (Calycanthus, Nymphaea, and Amborella) share only 13 such positions with Pinus and five species of Rosopsida share only five. The part of the ANITA grade (Nymphaea 1 Amborella 1 Pinus) is supported by 10 positions. We deleted nonspermatophyte outgroups and repeated the analysis. The number of positions supporting outgoup affinity of the above four groups changed to 151 (grasses 1 Pinus), 56 (Nymphaea 1 Amborella 1 Pinus), 30 (Calycanthus 1 Amborella 1 Nymphaea 1 Pinus), and 19 (Rosopsida 1 Pinus). For comparison, monophyly of angiosperms and spermatophytes is supported by 532 and 631 positions, respectively. The positions supporting outgroup affinity of grasses and of the species from the ANITA grade are shown in figure 3. In the total alignment of the first and the second codon positions, the mean distance across the range Pinus to monocots is 0.17 substitutions/position (ML estimation with Tamura-Nei model of substitution). Given that value, the probability that a position mutated twice since the separation of the gymnosperm and monocot lines is 0.0289, which corresponds to one twice-mutated position out of 34.6. Therefore, if the rate of substitutions in 151 alignment positions supporting grouping of monocots with Pinus (fig. 3) does not exceed the mean one characteristic of the whole alignment, this subset could be expected to contain approximately four homoplastic positions. The mean distance among eight dicotyledonous plants in the 151-position subset is 0.11, which is higher than the corresponding distance observed in the total alignment of 61 genes (0.065 substitutions/position, same model). Taking into account this somewhat elevated substitution rate, the 151-position subset can be expected to contain approximately seven homoplastic positions. One can also note that the GC content of the 151-position subset (49.3%) is close to equilibrium and is similar to the total GC content of the alignment (44.4%). However, because the first codon positions can undergo synonymous substitutions, it is not immediately clear from the above observations whether support for gymnosperm affinity of the line leading to Zea, Oryza, and Triticum would be reflected in the protein sequences. So we repeated the analysis, this time using a 14,811 amino acid alignment. It was found to contain 45 positions that clearly support the separation of grasses and Pinus from the rest of the species under analysis. By contrast, the affinity of ANITA members to Pinus was supported by 19 positions, and that of Magnoliopsida and Rosopsida was supported by nine and four positions, respectively. As in the analyses of nucleotide sequences, removal of fern and liverwort sequences resulted in a stronger signal. In the alignment containing only spermatophyte species, there are 86 positions in which Pinus and three grasses share a character to the exclusion of other angiosperms, as opposed to 42, 13, and 11 positions supporting affinity of Pinus to, respectively, ANITA members, Magnoliopsida, and Rosopsida. The nucleotide alignment of 14 OTUs with 29,991 positions per species contains 93 synapomorphies shared among the dicots, whereas the number of synapomorphies supporting monophyly of eudicots and monocots is 17. In the 14,811 amino acid alignment, there are 93 synapomorphies that unite all dicots. The number of

1450 Goremykin et al. Table 1 Results of the Individual Analysis of the Alignments of the First and the Second Codon Positions Bootstrap Proportion Values a Alignment Length Number of Informative Positions Grasses Amborella Nymphaea Nymphaea 1 Amborella Calycanthus Magnoliids Rosopsida atpa 1026 142 32 0 1 0 0 9 42 atpb 1002 100 27 1 40 29 0 16 7 atpe 272 75 6 55 1 5 0 28 0 atpf 370 117 67 4 3 5 0 21 1 atph 164 10 0 0 0 0 0 0 0 atpi 502 63 31 5 0 0 4 1 7 ccsa 662 233 15 1 23 18 2 36 14 cema 466 165 25 43 0 27 0 20 14 clpp 448 125 11 0 0 0 0 0 0 lhba 126 19 14 0 0 0 0 0 0 matk 1061 479 60 2 0 21 0 12 15 peta 648 85 8 6 14 38 0 30 15 petb 432 20 31 1 0 1 0 0 0 petd 322 14 17 0 1 0 9 0 0 petg 76 9 0 10 0 1 0 0 0 petl 64 17 29 0 1 0 0 0 2 petn 60 3 1 2 1 0 1 0 1 psaa 1510 103 42 8 12 6 0 1 9 psab 1472 82 17 2 0 9 0 5 10 psac 164 9 22 0 0 0 0 0 0 psai 74 22 19 0 6 1 8 2 2 psaj 86 17 0 11 0 0 73 37 1 psba 728 23 25 35 0 0 0 6 7 psbb 1018 74 27 6 6 9 1 33 25 psbc 949 52 7 5 39 22 0 4 1 psbd 708 32 15 19 4 24 7 10 14 psbe 168 10 0 1 18 1 23 1 0 psbf 80 7 0 18 15 74 0 0 0 psbh 148 0 36 0 0 5 0 9 0 psbi 74 4 0 1 5 0 3 1 0 psbj 82 10 0 0 0 0 0 0 1 psbk 125 32 30 7 0 3 5 11 15 psbl 78 6 0 8 1 0 0 0 26 psbm 71 13 3 10 0 0 0 0 0 psbn 88 9 7 1 2 1 41 0 0 psbt 72 12 0 0 0 4 0 0 0 rbcl 954 88 7 0 0 48 0 47 0 rpl14 250 45 16 19 13 21 0 1 5 rpl16 274 47 7 17 2 26 0 17 20 rpl2 558 92 7 0 11 0 59 4 20 rpl20 248 84 39 0 0 15 0 10 12 rpl32 134 50 0 1 1 0 0 1 8 rpl33 140 41 3 13 1 1 13 1 0 rpl36 76 0 10 0 0 0 0 0 0 rpoa 691 226 25 6 0 40 0 50 12 rpob 2194 483 99 0 0 0 0 0 1 rpoc1 1418 357 100 0 0 0 0 0 0 rpoc2 2897 939 99 0 0 0 0 0 0 rps11 294 71 0 0 0 0 40 2 1 rps12 250 24 61 2 11 27 0 0 0 rps14 208 43 22 6 53 6 0 0 0 rps15 190 77 1 10 38 1 2 13 0 rps18 212 50 76 0 0 0 0 0 14 rps19 190 48 11 7 30 6 0 2 2 rps2 476 118 50 3 1 5 0 6 14 rps3 490 146 77 4 5 18 0 15 0 rps4 415 89 12 5 0 17 0 6 47 rps7 314 52 17 9 9 4 1 0 0 rps8 278 81 90 0 0 0 0 0 2 ycf3 338 35 0 0 6 0 0 0 70 ycf4 406 94 1 0 16 4 0 11 49 a Bootstrap proportion values supporting the branch bearing outgroups with grasses, Amborella, Nymphaea, Nymphaea plus Amborella, Magnoliopsida, and Rosopsida.

The Chloroplast Genome of Nymphaea alba 1451 FIG. 3. The alignment positions in which grasses (above) and members of ANITA group (below) share characters with Pinus to the exclusion of all other angiosperm lines in the alignment of the first and the second codon positions of 61 protein-coding genes common to 14 genomes of land plants. synapomorphies that unite monocots and eudicots in this alignment is 11. Discussion Our findings of strong contradictory signals between different phylogenetic analyses of angiosperm chloroplast genome data highlight the concern recently raised in Molecular Biology and Evolution over appropriate analysis of complete genome data and the problems of sequence concatenation (Holland et al. 2004). Our findings are also cautionary and provide insight into the importance of further complete genome sequences before strong conclusions can be drawn in respect of identification of the most basal angiosperms. In the majority of our analyses, we detected neither support for (1) monocot affinity of a line leading to modern Nymphaeales nor support for (2) an early appearance of Nymphaeales in the evolutionary history of flowering plants. These analyses suggested that Nymphaea alba is a derived representative of a dicot lineage, which is, of all species under analysis, most closely related to Amborella trichopoda. The clade bearing Amborella with Nymphaea was detected with lower support previously (Barkman et al. 2000, Graham and Olmstead 2000), although not its derived position, because it appeared as the first branch to

1452 Goremykin et al. split off among angiosperms in these trees. Because our taxon sampling remains limited, it is, however, premature to state close taxonomical relationship of Amborella and Nymphaeales. In analyses of the individual genes we registered multiple cases of support for the sister group relationship of those two plants, yet little support for their close gymnosperm affinity. It is the line leading to three grasses that was found in a most-basal position within the angiosperms in the highest number of phylogenetic analyses of the individual gene alignments. This branching was supported by the alignments of genes (matk, rpob, rpoc1, and rpoc2) containing the greatest numbers of informative positions among the 61 individual gene alignments we built. All alternative branches tested either received no support in these analyses or were supported each by a single alignment of a short gene with a low number of informative positions. The interpretation of these results is straightforward: There are no bona fide cases of support for alternative rootings of angiosperms on the level of individual gene alignments. The appearance of alternative topologies in analyses of individual genes is probably the result of stochastic variations in the substitution process becoming visible when character sampling becomes small, as was recently observed in the analysis of 108 genes from yeasts (Rokas et al. 2003). Because bootstrap proportion values are affected by the ratio of characters supporting different topologies but not by their absolute numbers, such small biased samples could still exhibit high BP support values for incorrectly inferred branches. Another conclusion that can be drawn from our analyses of individual genes is that the character-wise small data sets can be generally unreliable for elucidating phylogeny. Analyzed individually, no alignment recovered the very stable topology (fig. 2) we inferred from the analysis of the concatenated nucleotide alignment of 29,991 positions employing the same method of tree construction. This is consistent with the newer findings of Rokas et al. (2003) and is furthermore consistent with earlier analyses of chloroplast genome phylogeny comparing individual and concatenated alignments (Goremykin, Hansmann, and Martin 1997; Martin et al. 1998; Lockhart et al. 1999). An attempt to additionally test the outgroup affinity of the grasses revealed that they share the largest number of characters with Pinus to the exclusion of all other angiosperms among branches checked. The positions supporting that affinity are not hypermutated and do not exhibit any strong compositional bias. The number of such positions exceeds the number of positions supporting outgroup affinity of ANITA members approximately three times in the nucleotide and two times in the amino acid alignment; other conflicting signals are substantially less pronounced. It is possible for nonadjacent and nonrelated taxa on a tree to have more sequence identities than adjacent and related taxa if the molecular clock is violated in a way that the substitution rate among the nonadjacent taxa gets comparatively small. However, the above considerations cannot be applied to explain the result in the figure 3, because both the grasses and Pinus are born on the branches that are the longest among the spermatophytes. One can note that the above results are in good accord with the numbers of synapomorphies that can be observed within the angiosperm ingroup. In concatenated protein and nucleotide alignments, the clade uniting all dicots is supported by, respectively, eightfold and fivefold larger numbers of synapomorphies than an alternative clade bearing eudicots and monocots. These observations and results of individual gene analyses stand in sharp contrast to the topology favored by the optimal symmetric model applied to concatenated sequence data. This apparent contradiction may be easily explained. When the internal branches of the true underlying tree are short compared with the length of the external branches, tree building is expected to be problematic (Hendy and Penny 1987). This problem is exacerbated when sequence evolution is not well described by the assumed substitution model (Jermiin et al. 2004). In such cases, small deviations in model parameters can potentially lead to very different tree topologies supported by high BP support values. Once concatenated, the optimal symmetric model will merely be the best average fit to all these genes. The majority of analyses presented here converge on the basal monocot-dicot split. However, the extreme difference between internal ingroup and external outgroup branch lengths in the concatenated gene angiosperm tree shown in figure 2 suggests that we may be some ways from being confident of identifying the most basal angiosperm. Clearly, the sequencing of genomes for more closely related outgroups and putatively basal angiosperms will be important for overcoming potential problems of model misspecification and long-branch attraction. Supplementary Material The sequence reported in this paper has been deposited in the EMBL database (accession number AJ627251). Acknowledgment This publication was supported by a grant of the Deutsche Forschungsgemeinschaft. Literature Cited Adams, K. L., Y. L. Qiu, M. Stoutemyer, and J. D. Palmer. 2002. Punctuated evolution of mitochondrial gene content: High and variable rates of mitochondrial gene loss and transfer to the nucleus during angiosperm evolution. Proc. Natl. Acad. Sci. USA 99:9905 9912. Arber, A. 1920. Water plants: A study of the aquatic angiosperm. Cambridge University Press. London. Barkman, T. J., G. Chenery, J. R. McNeal, J. Lyons-Weiler, W. J. Ellisens, G. Moore, A. D. Wolfe, and C. W. depamphilis. 2000. Independent and combined analyses of sequences from all three genomic compartments converge on the root of flowering plant phylogeny. Proc. Natl. Acad. Sci. USA 97: 13166 13171. Bergthorsson, U., K. L. Adams, B. Thomason, and J. D. Palmer. 2003. Widespread horizontal transfer of mitochondrial genes in flowering plants. Nature 424:197 201.

The Chloroplast Genome of Nymphaea alba 1453 Bryant, D., and Moulton, V. 2004. Neighbor-net: an agglomerative method for the construction of phylogenetic networks. Mol. Biol. Evol. 21:255 265. Cronquist, A. 1981. An integrated system of classification of flowering plants. Columbia University Press, New York. Dahlgren, R. M. T., and H. T. Clifford. 1982. The Monocotyledons: a comparative study. Academic Press, New York. Ewing, B., L. Hillier, M. C. Wendl, and P. Green. 1998. Basecalling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res. 8:175 185. Felsenstein, J. 1989. PHYLIP (phylogeny inference package). Version 3.2. Cladistics 5:164 166. Friis, E. M., K. R. Pedersen, and P. R. Crane. 2001. Fossil evidence of water lilies (Nymphaeales) in the early Cretaceous. Nature 410:357 360. Goremykin, V., S. Hansmann, and W. Martin. 1997. Evolutionary analysis of 58 proteins encoded in six completely sequenced chloroplast genomes: revised molecular estimates of two seed plant divergence times. Plant Syst. Evol. 206: 337 351. Goremykin, V. V., K. I. Hirsch-Ernst, S. Wölfl, and F. H. Hellwig. 2003a. Analysis of the Amborella trichopoda chloroplast genome sequence suggests that Amborella is not a basal angiosperm. Mol. Biol. Evol. 20:14499 14505.. 2003b. The chloroplast genome of the basal angiosperm Calycanthus fertilis structural and phylogenetic analyses. Plant Syst. Evol. 242:119 135. Graham, S. W., and R. G. Olmstead. 2000. Utility of 17 chloroplast genes for inferring the phylogeny of the basal angiosperms. Am. J. Bot. 87:1712 1730. Guttenberg, H. V., and R. Müller-Schröder. 1958. Untersuchungen über die Entwicklung des Embryos und der Keimpflanze von Nuphar luteum. Planta 51:481 510. Haines, R. W., and K. A. Lye. 1975. Seedlings of Nymphaeaceae. J. Linn. Soc. Bot. 70:255 265. Hendy, M. D., and D. Penny. 1987. Edge lengths of trees from sequence data. Math. Biosci. 83:157 165. Hiratsuka, J., H. Shimada, R. Whittier et al. (16 co-authors). 1989. The complete sequence of the rice (Oryza sativa) chloroplast genome: intermolecular recombination between distinct trna genes accounts for a major plastid DNA inversion during the evolution of the cereals. Mol. Gen. Genet. 217:185 194. Holland, B. R., K. T. Huber, V. Moulton, and P. J. Lockhart. (in press). Using consensus networks to visualize contradictory evidence for species phylogeny. Mol. Biol. Evol. Hupfer, H., M. Swiatek, S. Hornung, R. G. Hermann, R. M. Maier, W.-L. Chiu, and B. Sears. 2000. Complete nucleotide sequence of the Oenothera elata plastid chromosome, representing plastome I of the five distinguishable Euoenothera plastomes. Mol. Gen. Genet. 263:581 585. Jermiin, L. S., S. Y. W. Ho, F. Ababneh, J. Robinson, and A. W. D. Larkum. (in press). The biasing effect of compositional heterogeneity on phylogenetic estimates may be underestimated. Syst. Biol. Kato, T., T. Kaneko, S. Sato, Y. Nakamura, and S. Tabata. 2000. Complete structure of the chloroplast genome of a legume, Lotus japonicus. DNA Res. 7:323 330. Lockhart, P. J., C. J. Howe, A. C. Barbrook, A. W. D. Larkum, and D. Penny. 1999. Spectral analysis, systematic bias, and the evolution of chloroplasts. Mol. Biol. Evol. 16:573 576. Lodkina, M. M. 1988. Evolutionary relationships of monocots and dicots derived from studies of embryo and seedlings data. Botanichesky Zhurnal 73:617 629. (in Russian). Maier, R. M., K. Neckermann, G. L. Igloi, and H. Kossel. 1995. Complete sequence of the maize chloroplast genome: gene content, hotspots of divergence and fine tuning of genetic information by transcript editing. J. Mol. Biol. 251:614 628. Martin, W., B. Stoebe, V. Goremykin, S. Hansmann, M. Hasegawa, and K. V. Kowallik. 1998. Gene transfer to the nucleus and the evolution of chloroplasts. Nature 393:162 165. Murray, M. G., and W. F. Thompson. 1980. Rapid isolation of high molecular weight DNA. Nucleic Acids Res. 8:4321 4325. Ogihara, Y., K. Isono, T. Kojim et al. (19 co-authors). 2002. Structural features of a wheat plastome as revealed by complete sequencing of chloroplast DNA. Mol. Genet. Genomics 266:740 746. Ohyama, K., H. Fukuzawa, T. Kohchi et al. (13 co-authors). 1986. Chloroplast gene organization deduced from complete sequence of liverwort Marchantia polymorpha chloroplast DNA. Nature 322:572 574. Posada, D., and K. A. Crandall. 1988. Modeltest: testing the model of DNA substitution. Bioinformatics. 14:817 818. Qiu, Y.-L., J. Lee, F. Bernasconi-Quadroni, D. E. Soltis, P. S. Soltis, M. Zanis, E. A. Zimmer, Z. Chen, V. Savolainen, and M. W. Chase. 1999. The earliest angiosperms: evidence from mitochondrial, plastid and nuclear genomes. Nature 402: 404 407. Rohweder, O., and P. K. Endress. 1983. Samenpflanzen. Georg Thieme Verlag, Stuttgart. Rokas, A., B. L. Williams, N. King, and S. B. Carroll. 2003. Genome-scale approaches to resolving incongruence in molecular phylogenies. Nature 425:798 804. Sato, S., Y. Nakamura, T. Kaneko, E. Asamizu, and S. Tabata. 1999. Complete structure of the chloroplast genome of Arabidopsis thaliana. DNA Res. 6:283 290. Savolainen, V., and M. W. Chase. 2003. A decade of progress in plant molecular phylogenetics. Trends Genet. 19:717 724. Schaffner, J. H. 1904. Some morphological peculiarities of the Nymphaeaceae and Helobiae. Ohio Nat. 4:83 92.. 1934. Phylogenetic taxonomy of plants. Quart. Rev. Biol. 9:129 160. Schmitz-Linneweber, C., R. M. Maier, J. P. Alcaraz, A. Cottet, R. G. Herrmann, and R. Mache. 2001. The plastid chromosome of spinach (Spinacia oleracea): complete nucleotide sequence and gene organization. Plant Mol. Biol. 45:307 315. Shinozaki, K., M. Ohme, M. Tanaka et al. (23 co-authors). 1986. The complete nucleotide sequence of the tobacco chloroplast genome: its gene organization and expression. EMBO J. 5:2043 2049. Soltis, P. S., D. E. Soltis, and M. W. Chase. 1999. Angiosperm phylogeny inferred from multiple genes as a tool for comparative biology. Nature 402:402 403. Staden, R., K. F. Beal, and J. K. Bonfield. 2000. The Staden package, 1998. Methods Mol. Biol. 132:115 130. Strimmer, K., and A. von Haeseler. 1996. Quartet puzzling: a quartet maximum likelihood method for reconstructing tree topologies. Mol. Biol. Evol. 13:964 969. Swofford, D. L. 2002. PAUP*: phylogenetic analysis using parsimony (*and other methods). Version 4. Sinauer Associates, Sunderland, Mass. Takhtajan, A. 1966. Systema et phylogenia Magnoliophytorum. Nauka, Moscow, Leningrad.. 1973. Evolution und Ausbreitung der Blütenpflanzen. Gustav Fischer Verlag, Jena. Tholesson, M. 2004. LDDist: a Perl module for calculating LogDet pair-wise distances for protein and nucleotide sequences. Bioinformatics 20:416 418. Van de Peer, Y., and R. De Wachter. 1994. TREECON for Windows: a software package for the construction and draw-

1454 Goremykin et al. ing of evolutionary trees for the Microsoft Windows environment. Comput. Applic. Biosci. 10:569 570. Wakasugi, T., J. Tsudzuki, S. Ito, K. Nakashima, T. Tsudzuki, and M. Sugiura. 1994. Loss of all ndh genes as determined by sequencing the entire chloroplast genome of the black pine Pinus thunbergii. Proc. Natl. Acad. Sci. USA 91:9794 9798. Zanis, M. J., D. E. Soltis, P. S. Soltis, S. Mathews, and M. J. Donoghue. 2002. The root of the angiosperms revisited. Proc. Natl. Acad. Sci. USA 99:6848 6853. Peter Lockhart, Associate Editor Accepted March 29, 2004