Removal of Noisy Characters from Chloroplast Genome-Scale Data Suggests Revision of Phylogenetic Placements of Amborella and Ceratophyllum

J Mol Evol (29) 68:197 24 DOI 1.17/s239-9-926-9 Removal of Noisy Characters from Chloroplast Genome-Scale Data Suggests Revision of Phylogenetic Placements of Amborella and Ceratophyllum Vadim V. Goremykin Æ Roberto Viola Æ Frank H. Hellwig Received: 29 June 28 / Accepted: 29 January 29 / Published online: 27 February 29 Ó Springer Science+Business Media, LLC 29 Abstract It is widely appreciated that noisy, highly variable data can impede phylogeney reconstruction. Researchers have for a long time omitted problematic data from phylogenetic analyses, such as the third-codon positions and variable regions. In the analyses of the phylogenetic relations of the angiosperms; however, inclusion of complete gene sequences into genomic-scale alignments has become a common practice. Here we demonstrate that this practice can be misleading. We show that support of the basal-most position of Amborella trichopoda among the angiosperms in the chloroplast genomic data is based only on a tiny subset (\ 1% of the total alignment length) of the most variable positions in alignment, exhibiting mean maximum likelihood (ML) distance among the angiosperm operational taxonomic units (OTUs) approximately 36 substitutions/site. Exclusion of these positions leads to disappearance of the basal Amborella branch. Likewise, the recently reported sistergroup relationship of Ceratophyllum to the eudicots is based on the presence of 2% of the most variable positions in the genomic alignment, exhibiting, on average, 2 substitutions/site in comparison among the angiosperm OTUs. These observations highlight a need for excluding a certain proportion of saturated positions in alignment from phylogenomic analyses. V. V. Goremykin (&) R. Viola IASMA Research Center, Via E. Mach 1, 381 San Michele all Adige, TN, Italy e-mail: vadim.goremykin@iasma.it F. H. Hellwig Institut für Spezielle Botanik, Universität Jena, Philosophenweg 16, 7743 Jena, Germany Keywords Chloroplast genomes Molecular evolution Angiosperm diversification Introduction The first years of plant phylogenomics demonstrated that amassing large number of characters is not sufficient to ensure the accuracy of phylogenetic inference (Soltis et al. 24; Goremykin et al. 25). We also observed that systematic mistakes inherent to the current methods of phylogeny reconstruction may lead to the appearance of spurious results that are nonetheless strongly supported by the nonparametric bootstrap (Goremykin and Hellwig 26). Recently, Jeffroy et al. (26) made an interesting observation that a more reliable topology might be obtained with a worse-performing method applied to data with less saturation than with a more exact method applied to data with high saturation. In particular, they observed that removal of the fast-evolving positions makes the results of phylogeny reconstruction much less affected by nonphylogenetic (in particular compositional) signals, and, consequently, recommended the voluntary discarding of part of the data from a phylogenetic analysis. In this study, we wished to check if such decrease of variability could provide new insights into the early diversification of angiosperms, an area of considerable recent debate (Stefanovic et al. 24; Soltis et al. 24; Goremykin et al. 25; Leebens-Mack et al. 25; Goremykin and Hellwig 26; Jansen et al. 27; Moore et al. 27). We hypothesized that the proportion of data still bearing witness to the old diversification events might be low in this case, given the relatively short geologic time span for the origin of the major angiosperm groups compared with the long time passed since then (Crane et al.

198 J Mol Evol (29) 68:197 24 1995). An attempt to resolve this old radiation is aggravated by massive extinction in certain angiosperm lines, which left a number of highly isolated taxa subtended by long branches at the base of the angiosperm tree (Amborellales, Ceratophyllales, and Nymphaeales). Because of the superimposed mutations, various attraction artefacts can be deemed quite probable under such conditions. Because nothing can be done to improve taxon sampling in the vicinity of such isolated angiosperm lines, removal of the distorting nonphylogenetic signal (i.e., noise) is the only currently available way to improve phylogenetic reconstruction of these taxa. Previously we routinely excluded the highly divergent third-codon positions from our genomic analyses because of these concerns. Recently Stefanovic et al. (24) and Leebens-Mack et al. (25) reported that the third-codon positions should be included in phylogeny reconstruction studies on basal angiosperms. According to their suggestion, this time we present the results obtained with and without the third-codon positions as well as the results obtained after discarding a small proportion of the saturated alignment positions. We present here a data set comprising 95 chloroplast gene sequences as well as 2 introns and 7 intergenic transcribed spacers from the inverted repeat of cpdna. Results of phylogenetic reconstruction based on this data set suggest that the placement of Ceratophyllum as a sister to the eudicots (Moore at al. 27), as well as the well-publicised basalmost placement of Amborella among the extant angiosperms (most recently asserted by Jansen et al. 27 in their phylogenomic research of cpdna), are most likely artefacts because of the presence of noisy data in alignment. Materials and Methods Genome Sequencing Fresh shoots of Ceratophyllum demersum were harvested from a plant grown at the Botanical Garden of the University of Jena, Germany. Total DNA was extracted using the cetyltrimethylammoniumbromid-based method (Murray and Thompson 198) and purified with Qiagen columns according to the manufacturer s protocol (Qiagen, Valencia, CA). We employed a long-range polymerase chain reaction (PCR) strategy to cover a chloroplast genome with PCR products as previously described (Goremykin et al. 23). To fill the gaps in the genomic assembly, we also developed a set of Ceratophyllum-specific primers. The resulting products were purified by electrophoresis through low-melting agarose gels. According to agarose digestion with agarase, DNA in the resulting solution was directly subjected to fragmentation and subcloning employing the TOPO Shotgun Subcloning Kit (Invitrogen, Groningen, The Netherlands) according to the manufacturer s protocol. Recombinant plasmids were isolated from the clones using the Montage Plasmid Miniprep Kit (Millipore, Eschborn, Germany). The resulting plasmid DNA was prepared for sequence analysis with the Big Dye Terminator Sequencing Kit (Applied Biosystems, Foster City, CA) according to the manufacturer s protocol. Automated sequencing was performed on ABI 31 (Applied Biosystems) sequencers. ABI-reads were base-called with the PHRED program (Ewing and Green 1998). Sequence masking and assembly was performed with the STADEN package (Staden et al. 2). At the first stage of the plastome amplification, the reads were accumulated until 89 coverage was achieved for all PCR fragments. At the second stage (closure of the remaining gaps by PCR), we accepted at least 39 coverage for smaller PCR products. Results Intraspecific Divergence Among Chloroplast Genomes The cpdna of C. demersum that we sequenced is a 156,177 bp-long circular molecule, 76 bases shorter then the previously published cpdna from the northern American specimen of this plant (Moore et al. 27). The difference in size is caused by numerous indels concentrated in the noncoding regions of both plants. In addition to the indels, these two genome sequences have 257 single nucleotide polymorphisms, which correspond to 1 mutation per 69 alignment positions and to an uncorrected p distance of.16 substitutions/site (s/s). Two Ceratophyllum cpdna sequences have no inversions in respect to each other and have the same gene content. The lengths of Ceratophyllum cpdnas and their gene content are typical for the plastomes of the dicotyledonous angiosperms, with the latter being, for instance, identical to that of Nymphaea alba (Goremykin et al. 24). In addition to Ceratophyllum, chloroplast genomes from different specimen of the same species are currently available for two cultivars of Oryza sativa: indica (Tang et al. 24) and japonica (Hiratsuka et al. 1989). Wishing to estimate intraspecific sequence divergence in rice, we manually aligned these 2 sequences. Resulting alignment contained 152 single nucleotide polymorphisms, which corresponds to 1 mutation for 886 alignment positions and an uncorrected p distance of.11 s/s. High numbers of substitution observed between cpdnas from the same species suggests that chloroplast genomes can be a useful tool for population genetics studies.

J Mol Evol (29) 68:197 24 199 Phylogenetic Analyses Sequences of the 61 protein-coding genes, 3 trna genes, and 4 rrna genes, as well as those of the 7 spacers and 2 introns located in the most conserved part of the inverted repeat region of the cpdna, were sampled from the annotated sequences of the publicly available chloroplast genomes as well as from our de novo sequenced cpdna of C. demersum (EBI accession number AM71298). They were sorted into separate files for each individual gene and region. Files containing the protein-coding sequences were processed to produce alignments of all codon positions and of the first and the second codon positions. Nonproteincoding sequences were aligned using CLUSTALW. These individual alignments were manually concatenated and edited to produce a 53,848 position-long alignment (referred to hereafter as alignment A) and its 39,22 position-long subset with no third-codon positions (alignment B). Phylogenetic trees were constructed employing PAUP* v.4.b1 (Swofford 22) and PHYML (Guindon and Gascuel 23). We performed tests of model fitness (hierarchical likelihood ratio test (hlrt) and the Akaike Information Criterion-based test (AIC)) as implemented in Modeltest (Posada and Crandall 1998) based on the A and B alignments and identified the base substitution models best describing our data (GTR? I? C in both test cases). Using these, we built maximum likelihood (ML) trees with the help of PAUP*. To get the bootstrap branch support values for the trees obtained, we used the bootstrapping algorithm implemented in the PHYML, employing the previously mentioned model and the trees recovered previously with the help of PAUP* as the input trees. We did this because it would take a prohibitively long time to perform bootstrap with PAUP*. The ML tree built from alignment A is shown in Fig. 1. The topology obtained after the third positions were removed (alignment B) is highly similar to the tree presented in Fig. 1. However, it supports a sister-group relationship of Amborella and Nymphaea (78/1 bootstrap proportion support) at the base of the tree instead of the basal-most placement of the former species among the extant angiosperms. Piper is not sister to Drimys as was the case of alignment A but forms a sister group to the cluster (Drimys [Calycanthus, Liriodendron]). The branching order of the other operational taxonomic units (OTUs) is the same on the both ML trees. Having obtained slightly different trees, we wished to determine which placement of Amborella was more trustworthy. Previously, we had globally deleted the thirdcodon positions from our genomic data sets because they, on average, exhibit much higher substitution rates compared with the first and the second positions. Removal of the third-codon positions is a widespread practice because it is easy to accomplish using available programs such as PAUP*. However, some third-codon positions are constant or nearly so, so there is no reason to get rid of them. At the same time, the first- and second-codon positions also contain a certain (smaller) proportion of some highly variable sites that arguably must be removed. A more objective but somewhat more complex way to deal with such instances of saturation would be to measure variability directly at each alignment position and to discard only those positions affected by such saturation. To do so we employed a character-sorting approach similar to the one we published previously (Goremykin et al. 1997). With the help of our Perl script (sorter. pl, available on request), we calculated p distances at each position of alignment A and then sorted the alignment positions in ascending order of the resulting values. The resulting alignment, which contains invariable positions to the left and the most divergent positions to the right, was subsequently iteratively shortened by 5 positions from the right-hand side, producing a series of the alignments with decreased variability. We identified the best symmetric ML models for the sorted alignment A and its first 19 shortened subsets using Modeltest. GTR? I? C was chosen by both tests implemented in this program (hlrt and AIC) in all 2 cases. Employing the settings of these models, we built 2 ML trees with PAUP; imported the resulting trees into PHYML to be used as starting trees; and performed bootstrap with the help of PHYML by employing the previously mentioned model. The results of these experiments are presented in Fig. 2. One can see that removal of the first 5 most variable positions (\1% of the total data length) from alignment A leads to the loss of support for the basal-most position of Amborella within the angiosperms. Removal of the 1 most divergent positions results in Ceratophyllum assuming the sister-group position to the branch bearing eudicots and monocots, and removal of 25 positions leads to shifting of the branch subtending Ceratophyllum further down the tree to the base of the cluster uniting four magnoliid species. An example of this topology is presented in Fig. 3. Further changes in tree topology do not occur until a total of 55 positions are removed. The noneudicot parts of trees, built on the basis of the subsets of alignment A with 55 to 8 of the most divergent positions removed, has the same topology as the tree in Fig. 3. The eudicot clusters on these trees contain unresolved branches with zero lengths. Further decrease of variability results in disintegration of the monocot and dicot clusters. Ceratophyllum becomes a sister group to Phalaenopsis and Acorus to Spinacia. There are numerous zero-length branches on these trees. To estimate the sequence divergence level within the subset of alignment A comprising the 5 most variable

2 J Mol Evol (29) 68:197 24 Fig. 1 Tree obtained in ML analyses of alignment A. The numbers next to the branches indicate bootstrap support bootstrap values 12 1 8 6 4 2 43863 44363 44863 45363 45863 46364 46863 Amborella+outgroup 47363 Ceratophyllum+eudicots Ceratophyllum+magnoliids 47863 48363 48863 49363 49863 lengths of alignments 5363 5863 51363 51863 Amborella+Nymphaea 52363 52863 53363 Ceratophyllum+monocots+eudicots Ceratophyllum+Phalenopsis Fig. 2 Bootstrap support for the various placements of Amborella and Ceratophyllum in the ML trees built on the basis of sorted alignment A and its 19 subsets with decreased variability. The numbers below the graph indicate alignment lengths. The bootstrap values supporting the branch subtending all angiosperms, except Nymphaea and Amborella, were approximately 1% throughout the variability removal process and are not shown positions, we used the following procedure in PAUP*: (1) used Modeltest to find the optimal model for this subset, (2) imported the settings of the best model into PAUP* using the lset command (Swofford 22), (3) set the distance to ml (PAUP* command: dset distance = ml;), (4) constructed a neighbor-joining (NJ) tree (PAUP* command: nj;), and (5) exported the distance matrix used to build the tree into a file (5.matrix, supplementary materials) using the savedist command (Swofford 22). The mean distance in this matrix is 43.23 s/s. The mean distance among the angiosperms in this matrix is 36.6 s/s. We wished to see if there is any correlation among the distances between Amborella and other species calculated on the basis of the 5-position subset of the most divergent positions and the rest of the alignment. We reproduced the distance matrixes using Tree-Puzzle v. 5.2 (Strimmer and von Haeseler 1996), each time setting the rate parameters of GTR? G model to those calculated with the help of PAUP. This was done in order to be able to see the part of the graphs depicting distances lower than 9 s/s. PAUP matrices contained distances higher than 1 s/s, so the part of the graph

J Mol Evol (29) 68:197 24 21 Fig. 3 Tree topology obtained in ML analyses of the subsets of alignment A with the 2 to 5 most variable positions removed. The tree presented was obtained on the basis of a 48,363 position-long subset below 1 s/s would appear flattened. Tree-Puzzle sets very high (and therefore very unreliable) distances to approximately 9 s/s. The resulting dot plot is presented in Fig. 4a. There is no visible correlation in distances estimated from the 5-position subset and from the rest of alignment A. With the subset comprising the 1 most variable positions, PAUP could not build an NJ tree using the model settings suggested by Modeltest; therefore we followed an example in the PAUP command reference manual (Swofford 22, p. 57) to fit GTR? I? G to this data. We exported the distance matrix into a file in NEXUS format (1.matrix, supplementary materials). The mean distance in this matrix is 29.79 s/s. The mean distance among the angiosperm OTUs in this matrix is 2.26 s/s. The dot-plot depicting correlations in the distances between Ceratophyllum and other species calculated from the 1-position subset and the rest of alignment A is presented on Fig. 4b. The distribution of distance pairs in Fig. 4b is also nonlinear. Distance calculation for the subset of the 25 most variable positions was conducted as in the case of 5-position subset. The mean distance in this matrix (25.matrix, supplementary materials) is 1.6 s/s. The mean distance among the angiosperm OTUs is 1.3 s/s. The dot-plot depicting correlations in the distances between Ceratophyllum and other species calculated from the 25-position subset and the rest of alignment A is shown in Fig. 4c. The distribution of distance pairs in Fig. 4c becomes less broad. Discussion Previously we reported that the choice of substitution model strongly affects the results of the ML inference of the phylogenetic relations among the major angiosperm lineages (Goremykin et al. 25; Goremykin and Hellwig 26). The results presented here demonstrate that careful choice of the ML model alone is not enough. We observed that the presence of a small proportion of highly variable positions in alignment alters the structure of the angiosperm subtree. We therefore suggest, in the absence of

22 J Mol Evol (29) 68:197 24 ML distances in 5 pos. subset 1 9 8 7 6 5 4 3 2 1 Distances from Amborella to other species. 12 out of 33 distances have maximum value.,1,2,3,4,5 ML distances in alignment A shortened by 5 pos. B) Distances from Ceratophyllum to other species. 4,5 ML distances in 1 pos. subset 4 3,5 3 2,5 2 1,5 1,5 ML distances in alignment A shortened by 1 pos. C) Distances from Ceratophyllum to other species 3 ML distances in 25 pos. subset A) 2,5 2 1,5 1,5,1,2,5,1,15,2,25,3,35,4 ML distances in alignment A shortened by 25 pos. Fig. 4 Correlation in distances in the subsets of alignment A included and excluded from analyses. a Dot-plots depicting correlations in the distances between Amborella and other species calculated from the 5-position subset of the most divergent positions and the rest of alignment A. b Dot-plots depicting correlations in the distances between Ceratophyllum and other species calculated from the 1- position subset of the most divergent positions and the rest of alignment A. (c) Dot-plots depicting correlations in the distances between Ceratophyllum and other species calculated from the 25- position subset of the most divergent positions and the rest of alignment A,3,4,5 better base substitution models, discarding a certain proportion of the most variable sites from the alignment to avoid potential errors (e.g., Felsenstein 1978; Bergsten 25; Jeffroy et al. 26) in phylogeny reconstruction. In our previous studies, we chose to simply remove the divergent third-codon positions from analysis to accomplish this. However, this did not become a widespread practice, perhaps because of recommendations by Stefanovic et al. (24) and Leebens-Mack et al. (25). These investigators advocated inclusion of the third-codon positions into phylogenomic analyses of angiosperm evolution based on cpdna, citing the insignificant changes that these positions introduced to their trees. Here we observed a change caused by inclusion of the variable third-codon positions, which invokes canonical basal-most placement of Amborella, reported in a large number of papers (Mathews and Donoghue 1999, 2; Parkinson et al. 1999; Qiu et al. 1999, 2, 25; Soltis et al. 1999, 2a, b; Barkman et al. 2; Borsch et al. 23; Hilu et al. 23; Stefanovic et al. 24; Leebens- Mack et al. 25; Jansen et al. 27; Moore et al. 27). At the same time, alignment with the third-codon positions removed rather strongly supports a sister-group relation of Amborella and Nymphaea. To appraise which placement of Amborella is more trustworthy, we sorted the characters in the 53,363 position-long, genome-scale alignment according to their variability and repeatedly shortened the sorted alignment from its most divergent end, producing a series of its subsets. Then we built trees from the sequences of these subsets. This procedure allowed us to directly observe the influence of the most variable characters on the results of phylogeny reconstruction. We observed that removal of just 5 of the most variable positions from the alignment lead to the disappearance of the Amborella-basal topology. This 5- position subset exhibits mean distance [3 s/s in comparison among angiosperms and cannot be expected to bear witness to the evolutionary events that happened during the primary radiation of this plant group. The 5-position subset shows no traces of similarity between Pinus OTU and angiosperm OTUs. Its removal is therefore justified. Equally justified therefore is revision of the assertion that plastid genomic data unequivocally support Amborella as the sole sister group of the remaining angiosperms (Jansen et al. 27). Similarly, the sister-group relation between Ceratophyllum and eudicots, recently reported by Moore et al. (27), depends on the presence of the 1 most divergent positions in the alignment A. This 1-position alignment subset also exhibits very high saturation level and is best omitted from this taxonomic level of analysis unless a good

J Mol Evol (29) 68:197 24 23 case can be made in the future as to why it should be used for deep angiosperm phylogeny. The choice between two further alternative placements of Ceratophyllum as a sister group to the clade subtending eudicots plus monocots (which is supported by subsets of alignment A with 1 to 2 positions removed or as a sister to the clade ([Calycanthus,Liriodendron] [Drimys, Piper]) (which is supported by alignments with 25 to 8 positions removed) is more difficult to make. A mean distance among the angiosperms in the subset of the 25 most divergent position is 1.3 s/s. The substitution paths on this divergence level might or might not be well-described by the general time-reversible family of substitution models. To make a certain judgement between these placements, we would need to apply some objective stopping criterion for the removal of moderately saturated sites derived from the comparative performance of different substitution models applied to the different (slightly modified) data. Such methodology is not currently available; however, we are currently working in this area. Strong and consistent bootstrap support of the placement of Ceratophyllum as a sister to magnoliids, through a long phase of variability decrease until the stage characterised by apparent loss of phylogenetic signal, can be interpreted in favour of magnolian affinity of this species. References Barkman TJ, Chenery G, McNeal JR, Lyons-Weile J, Ellisens WJ, Moore G, Wolfe AD, depamphilis CW (2) Independent and combined analyses of sequences from all three genomic compartments converge on the root of flowering plant phylogeny. Proc Natl Acad Sci USA 97:13166 13171 Bergsten J (25) A review of long-branch attraction. Cladistics 21:163 193 Borsch T, Hilu KW, Quandt D, Wilde V, Neinhuis C, Barthlott W (23) Non-coding plastid trnt-trnf sequences reveal a well resolved phylogeny of basal angiosperms. J Evol Biol 16:558 576 Crane PR, Friis EM, Pedersen KR (1995) The origin and early diversification of angiosperms. Nature 374:27 33 Ewing B, Green P (1998) Base-calling of automated sequencer traces using Phred. II. Error probabilities. Genome Res 8:186 194 Felsenstein J (1978) Cases in which parsimony or compatibility methods will be positively misleading. Syst Zool 27:41 41 Goremykin V, Hansmann S, Martin W (1997) Evolutionary analysis of 58 proteins encoded in six completely sequenced chloroplast genomes: Revised molecular estimates of two seed plant divergence times. Plant Syst Evol 26:337 351 Goremykin VV, Holland B, Hirsch-Ernst KI, Hellwig FH (23) The chloroplast genome of the basal angiosperm Calycanthus fertilis structural and phylogenetic analyses. Plant Syst Evol 242:119 135 Goremykin VV, Hirsch-Ernst KI, Wolfl S, Hellwig FH (24) The chloroplast genome of Nymphaea alba: Whole-genome analyses and the problem of identifying the most basal angiosperm. Mol Biol Evol 21:1445 1454 Goremykin VV, Holland B, Hirsch-Ernst KI, Hellwig FH (25) Analysis of Acorus calamus chloroplast genome and its phylogenetic implications. Mol Biol Evol 22:1813 1822 Goremykin VV, Hellwig FH (26) A new test of phylogenetic model fitness addresses the issue of the basal angiosperm phylogeny. Gene 381:81 91 Guindon S, Gascuel O (23) A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst Biol 52:54 696 Hilu KW, Borsch T, Muller K, Soltis DE, Soltis PS, Savolainen V, Chase MW, Powell M, Alice L, Evans R et al (23) Angiosperm phylogeny based on matk sequence information. Am J Bot 9:1758 1776 Hiratsuka J, Shimada H, Whittier R, Ishibashi T, Sakamoto M, Mori M, Kondo C, Honji Y, Sun CR, Meng BY et al (1989) The complete sequence of the rice (Oryza sativa) chloroplast genome: intermolecular recombination between distinct trna genes accounts for a major plastid DNA inversion during the evolution of the cereals. Mol Gen Genet 217:185 194 Jansen RK, Cai Z, Raubeson LA, Daniell H, Depamphilis CW, Leebens-Mack J, Muller KF, Guisinger-Bellian M, Haberle RC, Hansen AK et al (27) Analysis of 81 genes from 64 plastid genomes resolves relationships in angiosperms and identifies genome-scale evolutionary patterns. Proc Natl Acad Sci USA 14:19369 19374 Jeffroy O, Brinkmann H, Delsuc F, Philippe H (26) Phylogenomics: the beginning of incongruence? Trends Genet 22:225 231 Leebens-Mack J, Raubeson LA, Cui LY, Kuehl JV, Fourcade MH, Chumley TW, Boore JL, Jansen RK, de Pamphilis CW (25) Identifying the basal angiosperm node in chloroplast genome phylogenies: sampling one s way out of the Felsenstein zone. Mol Biol Evol 22:1948 1963 Mathews S, Donoghue MJ (1999) The root of angiosperm phylogeny inferred from duplicate phytochrome genes. Science 286:947 95 Mathews S, Donoghue MJ (2) Basal angiosperm phylogeny inferred from duplicate phytochromes A and C. Int J Plant Sci 161(Suppl):S41 S55 Moore MJ, Bell CD, Soltis PS, Soltis DE (27) Using plastid genome-scale data to resolve enigmatic relationships among basal angiosperms. Proc Natl Acad Sci USA 14:19363 19368 Murray MG, Thompson WF (198) Rapid isolation of high molecular weight DNA. Nucleic Acids Res 8:4321 4325 Posada D, Crandall KA (1998) Modeltest: Testing the model of DNA substitution. Bioinformatics 14:817 818 Parkinson CL, Adams KL, Palmer JD (1999) Multigene analyses identify the three earliest lineages of extant flowering plants. Curr Biol 9:1485 1488 Qiu Y-L, Lee J, Bernasconi-Quadroni F, Soltis DE, Soltis PS, Zanis M, Zimmer EA, Chen Z, Savolainen V, Chase MW (1999) The earliest angiosperms: evidence from mitochondrial, plastid and nuclear genomes. Nature 42:44 47 Qiu Y-L, Lee J, Bernasconi-Quadroni F, Soltis DE, Soltis PS, Zanis M, Zimmer EA, Chen Z, Savolainen V, Chase MW (2) Phylogeny of basal angiosperms: analyses of five genes from three genomes. Int J Plant Sci 161(Suppl):S3 S27 Qiu Y-L, Dombrovska O, Lee J, Li L, Whitlock BA, Bernasconi- Quadroni F, Rest JS, Davis CC, Borsch T, Hilu KW et al (25) Phylogenetic analyses of basal angiosperms based on nine plastid, mitochondrial, and nuclear genes. Int J Plant Sci 166:815 842 Soltis PS, Soltis DE, Chase MW (1999) Angiosperm phylogeny inferred from multiple genes as a tool for comparative biology. Nature 42:42 43 Soltis PS, Soltis DE, Zanis MJ, Kim S (2a) Basal lineages of angiosperms: relationships and implications for floral evolution. Int J Plant Sci 161(Suppl):S97 S17

24 J Mol Evol (29) 68:197 24 Soltis DE, Soltis PS, Chase MW, Mort ME, Albach DC, Zanis M, Savolainen V, Hahn WH, Hoot SB, Fay MF et al (2b) Angiosperm phylogeny inferred from 18S rdna, rbcl, and atpb sequences. Bot J Linn Soc 133:381 461 Soltis DE, Albert VA, Savolainen V, Hilu K, Qiu Y-L, Chase MW, Farris JS, Stefanovic S, Rice DW, Palmer JD, Soltis PS (24) Genome-scale data, angiosperm relationships, and ending incongruence : A cautionary tale in phylogenetics. Trends Plants Sci 9:477 483 Staden R, Beal KF, Bonfield JK (2) The Staden package 1998. Meth Mol Biol 132:115 13 Stefanovic S, Rice DW, Palmer JD (24) Long branch attraction, taxon sampling, and the earliest angiosperms: Amborella or monocots? BMC Evol Biol 4:35 Strimmer K, von Haeseler A (1996) Quartet puzzling: a quartet maximum likelihood method for reconstructing tree topologies. Mol Biol Evol 13:964 969 Swofford DL (22) PAUP*: phylogenetic analysis using parsimony (* and other methods). Version 4. Sinauer, Sunderland Tang J, Xia H, Cao M, Zhang X, Zeng W, Hu S, Tong W, Wang J, Wang J, Yu J, Yang H, Zhu Z (24) A comparison of rice chloroplast genomes. Plant Physiol 135:412 42