Removal of Noisy Characters from Chloroplast Genome-Scale Data Suggests Revision of Phylogenetic Placements of Amborella and Ceratophyllum

Similar documents
Third-codon transversion rate-based Nymphaea basal angiosperm phylogeny -- concordance with developmental evidence

Phylogenetic relationship among S. castellii, S. cerevisiae and C. glabrata.

The Phylogenetic Reconstruction of the Grass Family (Poaceae) Using matk Gene Sequences

The origin of flowering plants and characteristics of angiosperm

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

C3020 Molecular Evolution. Exercises #3: Phylogenetics

Dr. Amira A. AL-Hosary

Another Look at the Root of the Angiosperms Reveals a Familiar Tale

The Phylogenetic Handbook

On the Inter-Generic Hybrid Sasaella ramosa. Yoshiyuki HOSOYAMA, Kazuko HOSHIDA, Sonoe TAKEOKA and Shohei MIYATA. (Received November 30, 2001)

Effects of Gap Open and Gap Extension Penalties

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

Using phylogenetics to estimate species divergence times... Basics and basic issues for Bayesian inference of divergence times (plus some digression)

Flowering plants (Magnoliophyta)

Consensus Methods. * You are only responsible for the first two

The Chloroplast Genome of Nymphaea alba: Whole-Genome Analyses and the Problem of Identifying the Most Basal Angiosperm

Letter to the Editor. Department of Biology, Arizona State University

Assessing an Unknown Evolutionary Process: Effect of Increasing Site- Specific Knowledge Through Taxon Addition

Phylogenetic analyses. Kirsi Kostamo

Phylogenetics: Building Phylogenetic Trees

Phylogenetic Tree Reconstruction

Phylogenomics: the beginning of incongruence?

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

Constructing Evolutionary/Phylogenetic Trees

Efficiencies of maximum likelihood methods of phylogenetic inferences when different substitution models are used

Can taxon-sampling effects be minimized by using branch supports? P. Hovenkamp

Constructing Evolutionary/Phylogenetic Trees

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

Chapter 26: Phylogeny and the Tree of Life Phylogenies Show Evolutionary Relationships

STEM-hy: Species Tree Estimation using Maximum likelihood (with hybridization)

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches

Algorithms in Bioinformatics

Phylogenetic Trees. Phylogenetic Trees Five. Phylogeny: Inference Tool. Phylogeny Terminology. Picture of Last Quagga. Importance of Phylogeny 5.

Likelihood Ratio Tests for Detecting Positive Selection and Application to Primate Lysozyme Evolution

Missing data and influential sites: choice of sites for phylogenetic analysis can be as important as taxon-sampling and model choice

Molecular phylogeny - Using molecular sequences to infer evolutionary relationships. Tore Samuelsson Feb 2016

Phylogenetic inference

Letter to the Editor. The Effect of Taxonomic Sampling on Accuracy of Phylogeny Estimation: Test Case of a Known Phylogeny Steven Poe 1

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees

SHARED MOLECULAR SIGNATURES SUPPORT THE INCLUSION OF CATAMIXIS IN SUBFAMILY PERTYOIDEAE (ASTERACEAE).

8/23/2014. Phylogeny and the Tree of Life

MOLECULAR PHYLOGENY AND GENETIC DIVERSITY ANALYSIS. Masatoshi Nei"

PhyQuart-A new algorithm to avoid systematic bias & phylogenetic incongruence

CHAPTERS 24-25: Evidence for Evolution and Phylogeny

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

The Evolutionary Root of Flowering Plants

PGA: A Program for Genome Annotation by Comparative Analysis of. Maximum Likelihood Phylogenies of Genes and Species

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

Intraspecific gene genealogies: trees grafting into networks

The origin of angiosperms has long been considered a fundamental

What is Phylogenetics

Fast Phylogenetic Methods for the Analysis of Genome Rearrangement Data: An Empirical Study

Supplemental Data. Vanneste et al. (2015). Plant Cell /tpc

ASSESSING AMONG-LOCUS VARIATION IN THE INFERENCE OF SEED PLANT PHYLOGENY

Dynamic optimisation identifies optimal programs for pathway regulation in prokaryotes. - Supplementary Information -

first (i.e., weaker) sense of the term, using a variety of algorithmic approaches. For example, some methods (e.g., *BEAST 20) co-estimate gene trees

Integrative Biology 200A "PRINCIPLES OF PHYLOGENETICS" Spring 2008

Concepts and Methods in Molecular Divergence Time Estimation

C.DARWIN ( )

A Phylogenetic Network Construction due to Constrained Recombination

EVOLUTIONARY DISTANCES

Using Phylogenomics to Predict Novel Fungal Pathogenicity Genes

7. Tests for selection

Phylogenetic Networks, Trees, and Clusters

Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2018 University of California, Berkeley

BINF6201/8201. Molecular phylogenetic methods

Systematics - Bio 615

Organelle genome evolution

PHYLOGENY AND SYSTEMATICS

Estimating Evolutionary Trees. Phylogenetic Methods

Anatomy of a species tree

Cladistics and Bioinformatics Questions 2013

Smith et al. American Journal of Botany 98(3): Data Supplement S2 page 1

Non-independence in Statistical Tests for Discrete Cross-species Data

Small RNA in rice genome

Phylogenomics. Jeffrey P. Townsend Department of Ecology and Evolutionary Biology Yale University. Tuesday, January 29, 13

Quartet Inference from SNP Data Under the Coalescent Model

Supplementary material to Whitney, K. D., B. Boussau, E. J. Baack, and T. Garland Jr. in press. Drift and genome complexity revisited. PLoS Genetics.

Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2016 University of California, Berkeley. Parsimony & Likelihood [draft]

"PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200B Spring 2009 University of California, Berkeley

Chapter 7: Models of discrete character evolution

Phylogeny: traditional and Bayesian approaches

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

INCREASED RATES OF MOLECULAR EVOLUTION IN AN EQUATORIAL PLANT CLADE: AN EFFECT OF ENVIRONMENT OR PHYLOGENETIC NONINDEPENDENCE?

Title ghost-tree: creating hybrid-gene phylogenetic trees for diversity analyses

Supplementary Materials for

METHODS FOR DETERMINING PHYLOGENY. In Chapter 11, we discovered that classifying organisms into groups was, and still is, a difficult task.

Phylogeny: building the tree of life

Biology 559R: Introduction to Phylogenetic Comparative Methods Topics for this week (Jan 27 & 29):


SUPPLEMENTARY INFORMATION

Consistency Index (CI)

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

KaKs Calculator: Calculating Ka and Ks Through Model Selection and Model Averaging

Algorithmic Methods Well-defined methodology Tree reconstruction those that are well-defined enough to be carried out by a computer. Felsenstein 2004,

Molecular Markers, Natural History, and Evolution

An Amino Acid Substitution-Selection Model Adjusts Residue Fitness to Improve Phylogenetic Estimation

SUPPLEMENTARY INFORMATION

Lab 9: Maximum Likelihood and Modeltest

Transcription:

J Mol Evol (29) 68:197 24 DOI 1.17/s239-9-926-9 Removal of Noisy Characters from Chloroplast Genome-Scale Data Suggests Revision of Phylogenetic Placements of Amborella and Ceratophyllum Vadim V. Goremykin Æ Roberto Viola Æ Frank H. Hellwig Received: 29 June 28 / Accepted: 29 January 29 / Published online: 27 February 29 Ó Springer Science+Business Media, LLC 29 Abstract It is widely appreciated that noisy, highly variable data can impede phylogeney reconstruction. Researchers have for a long time omitted problematic data from phylogenetic analyses, such as the third-codon positions and variable regions. In the analyses of the phylogenetic relations of the angiosperms; however, inclusion of complete gene sequences into genomic-scale alignments has become a common practice. Here we demonstrate that this practice can be misleading. We show that support of the basal-most position of Amborella trichopoda among the angiosperms in the chloroplast genomic data is based only on a tiny subset (\ 1% of the total alignment length) of the most variable positions in alignment, exhibiting mean maximum likelihood (ML) distance among the angiosperm operational taxonomic units (OTUs) approximately 36 substitutions/site. Exclusion of these positions leads to disappearance of the basal Amborella branch. Likewise, the recently reported sistergroup relationship of Ceratophyllum to the eudicots is based on the presence of 2% of the most variable positions in the genomic alignment, exhibiting, on average, 2 substitutions/site in comparison among the angiosperm OTUs. These observations highlight a need for excluding a certain proportion of saturated positions in alignment from phylogenomic analyses. V. V. Goremykin (&) R. Viola IASMA Research Center, Via E. Mach 1, 381 San Michele all Adige, TN, Italy e-mail: vadim.goremykin@iasma.it F. H. Hellwig Institut für Spezielle Botanik, Universität Jena, Philosophenweg 16, 7743 Jena, Germany Keywords Chloroplast genomes Molecular evolution Angiosperm diversification Introduction The first years of plant phylogenomics demonstrated that amassing large number of characters is not sufficient to ensure the accuracy of phylogenetic inference (Soltis et al. 24; Goremykin et al. 25). We also observed that systematic mistakes inherent to the current methods of phylogeny reconstruction may lead to the appearance of spurious results that are nonetheless strongly supported by the nonparametric bootstrap (Goremykin and Hellwig 26). Recently, Jeffroy et al. (26) made an interesting observation that a more reliable topology might be obtained with a worse-performing method applied to data with less saturation than with a more exact method applied to data with high saturation. In particular, they observed that removal of the fast-evolving positions makes the results of phylogeny reconstruction much less affected by nonphylogenetic (in particular compositional) signals, and, consequently, recommended the voluntary discarding of part of the data from a phylogenetic analysis. In this study, we wished to check if such decrease of variability could provide new insights into the early diversification of angiosperms, an area of considerable recent debate (Stefanovic et al. 24; Soltis et al. 24; Goremykin et al. 25; Leebens-Mack et al. 25; Goremykin and Hellwig 26; Jansen et al. 27; Moore et al. 27). We hypothesized that the proportion of data still bearing witness to the old diversification events might be low in this case, given the relatively short geologic time span for the origin of the major angiosperm groups compared with the long time passed since then (Crane et al.

198 J Mol Evol (29) 68:197 24 1995). An attempt to resolve this old radiation is aggravated by massive extinction in certain angiosperm lines, which left a number of highly isolated taxa subtended by long branches at the base of the angiosperm tree (Amborellales, Ceratophyllales, and Nymphaeales). Because of the superimposed mutations, various attraction artefacts can be deemed quite probable under such conditions. Because nothing can be done to improve taxon sampling in the vicinity of such isolated angiosperm lines, removal of the distorting nonphylogenetic signal (i.e., noise) is the only currently available way to improve phylogenetic reconstruction of these taxa. Previously we routinely excluded the highly divergent third-codon positions from our genomic analyses because of these concerns. Recently Stefanovic et al. (24) and Leebens-Mack et al. (25) reported that the third-codon positions should be included in phylogeny reconstruction studies on basal angiosperms. According to their suggestion, this time we present the results obtained with and without the third-codon positions as well as the results obtained after discarding a small proportion of the saturated alignment positions. We present here a data set comprising 95 chloroplast gene sequences as well as 2 introns and 7 intergenic transcribed spacers from the inverted repeat of cpdna. Results of phylogenetic reconstruction based on this data set suggest that the placement of Ceratophyllum as a sister to the eudicots (Moore at al. 27), as well as the well-publicised basalmost placement of Amborella among the extant angiosperms (most recently asserted by Jansen et al. 27 in their phylogenomic research of cpdna), are most likely artefacts because of the presence of noisy data in alignment. Materials and Methods Genome Sequencing Fresh shoots of Ceratophyllum demersum were harvested from a plant grown at the Botanical Garden of the University of Jena, Germany. Total DNA was extracted using the cetyltrimethylammoniumbromid-based method (Murray and Thompson 198) and purified with Qiagen columns according to the manufacturer s protocol (Qiagen, Valencia, CA). We employed a long-range polymerase chain reaction (PCR) strategy to cover a chloroplast genome with PCR products as previously described (Goremykin et al. 23). To fill the gaps in the genomic assembly, we also developed a set of Ceratophyllum-specific primers. The resulting products were purified by electrophoresis through low-melting agarose gels. According to agarose digestion with agarase, DNA in the resulting solution was directly subjected to fragmentation and subcloning employing the TOPO Shotgun Subcloning Kit (Invitrogen, Groningen, The Netherlands) according to the manufacturer s protocol. Recombinant plasmids were isolated from the clones using the Montage Plasmid Miniprep Kit (Millipore, Eschborn, Germany). The resulting plasmid DNA was prepared for sequence analysis with the Big Dye Terminator Sequencing Kit (Applied Biosystems, Foster City, CA) according to the manufacturer s protocol. Automated sequencing was performed on ABI 31 (Applied Biosystems) sequencers. ABI-reads were base-called with the PHRED program (Ewing and Green 1998). Sequence masking and assembly was performed with the STADEN package (Staden et al. 2). At the first stage of the plastome amplification, the reads were accumulated until 89 coverage was achieved for all PCR fragments. At the second stage (closure of the remaining gaps by PCR), we accepted at least 39 coverage for smaller PCR products. Results Intraspecific Divergence Among Chloroplast Genomes The cpdna of C. demersum that we sequenced is a 156,177 bp-long circular molecule, 76 bases shorter then the previously published cpdna from the northern American specimen of this plant (Moore et al. 27). The difference in size is caused by numerous indels concentrated in the noncoding regions of both plants. In addition to the indels, these two genome sequences have 257 single nucleotide polymorphisms, which correspond to 1 mutation per 69 alignment positions and to an uncorrected p distance of.16 substitutions/site (s/s). Two Ceratophyllum cpdna sequences have no inversions in respect to each other and have the same gene content. The lengths of Ceratophyllum cpdnas and their gene content are typical for the plastomes of the dicotyledonous angiosperms, with the latter being, for instance, identical to that of Nymphaea alba (Goremykin et al. 24). In addition to Ceratophyllum, chloroplast genomes from different specimen of the same species are currently available for two cultivars of Oryza sativa: indica (Tang et al. 24) and japonica (Hiratsuka et al. 1989). Wishing to estimate intraspecific sequence divergence in rice, we manually aligned these 2 sequences. Resulting alignment contained 152 single nucleotide polymorphisms, which corresponds to 1 mutation for 886 alignment positions and an uncorrected p distance of.11 s/s. High numbers of substitution observed between cpdnas from the same species suggests that chloroplast genomes can be a useful tool for population genetics studies.

J Mol Evol (29) 68:197 24 199 Phylogenetic Analyses Sequences of the 61 protein-coding genes, 3 trna genes, and 4 rrna genes, as well as those of the 7 spacers and 2 introns located in the most conserved part of the inverted repeat region of the cpdna, were sampled from the annotated sequences of the publicly available chloroplast genomes as well as from our de novo sequenced cpdna of C. demersum (EBI accession number AM71298). They were sorted into separate files for each individual gene and region. Files containing the protein-coding sequences were processed to produce alignments of all codon positions and of the first and the second codon positions. Nonproteincoding sequences were aligned using CLUSTALW. These individual alignments were manually concatenated and edited to produce a 53,848 position-long alignment (referred to hereafter as alignment A) and its 39,22 position-long subset with no third-codon positions (alignment B). Phylogenetic trees were constructed employing PAUP* v.4.b1 (Swofford 22) and PHYML (Guindon and Gascuel 23). We performed tests of model fitness (hierarchical likelihood ratio test (hlrt) and the Akaike Information Criterion-based test (AIC)) as implemented in Modeltest (Posada and Crandall 1998) based on the A and B alignments and identified the base substitution models best describing our data (GTR? I? C in both test cases). Using these, we built maximum likelihood (ML) trees with the help of PAUP*. To get the bootstrap branch support values for the trees obtained, we used the bootstrapping algorithm implemented in the PHYML, employing the previously mentioned model and the trees recovered previously with the help of PAUP* as the input trees. We did this because it would take a prohibitively long time to perform bootstrap with PAUP*. The ML tree built from alignment A is shown in Fig. 1. The topology obtained after the third positions were removed (alignment B) is highly similar to the tree presented in Fig. 1. However, it supports a sister-group relationship of Amborella and Nymphaea (78/1 bootstrap proportion support) at the base of the tree instead of the basal-most placement of the former species among the extant angiosperms. Piper is not sister to Drimys as was the case of alignment A but forms a sister group to the cluster (Drimys [Calycanthus, Liriodendron]). The branching order of the other operational taxonomic units (OTUs) is the same on the both ML trees. Having obtained slightly different trees, we wished to determine which placement of Amborella was more trustworthy. Previously, we had globally deleted the thirdcodon positions from our genomic data sets because they, on average, exhibit much higher substitution rates compared with the first and the second positions. Removal of the third-codon positions is a widespread practice because it is easy to accomplish using available programs such as PAUP*. However, some third-codon positions are constant or nearly so, so there is no reason to get rid of them. At the same time, the first- and second-codon positions also contain a certain (smaller) proportion of some highly variable sites that arguably must be removed. A more objective but somewhat more complex way to deal with such instances of saturation would be to measure variability directly at each alignment position and to discard only those positions affected by such saturation. To do so we employed a character-sorting approach similar to the one we published previously (Goremykin et al. 1997). With the help of our Perl script (sorter. pl, available on request), we calculated p distances at each position of alignment A and then sorted the alignment positions in ascending order of the resulting values. The resulting alignment, which contains invariable positions to the left and the most divergent positions to the right, was subsequently iteratively shortened by 5 positions from the right-hand side, producing a series of the alignments with decreased variability. We identified the best symmetric ML models for the sorted alignment A and its first 19 shortened subsets using Modeltest. GTR? I? C was chosen by both tests implemented in this program (hlrt and AIC) in all 2 cases. Employing the settings of these models, we built 2 ML trees with PAUP; imported the resulting trees into PHYML to be used as starting trees; and performed bootstrap with the help of PHYML by employing the previously mentioned model. The results of these experiments are presented in Fig. 2. One can see that removal of the first 5 most variable positions (\1% of the total data length) from alignment A leads to the loss of support for the basal-most position of Amborella within the angiosperms. Removal of the 1 most divergent positions results in Ceratophyllum assuming the sister-group position to the branch bearing eudicots and monocots, and removal of 25 positions leads to shifting of the branch subtending Ceratophyllum further down the tree to the base of the cluster uniting four magnoliid species. An example of this topology is presented in Fig. 3. Further changes in tree topology do not occur until a total of 55 positions are removed. The noneudicot parts of trees, built on the basis of the subsets of alignment A with 55 to 8 of the most divergent positions removed, has the same topology as the tree in Fig. 3. The eudicot clusters on these trees contain unresolved branches with zero lengths. Further decrease of variability results in disintegration of the monocot and dicot clusters. Ceratophyllum becomes a sister group to Phalaenopsis and Acorus to Spinacia. There are numerous zero-length branches on these trees. To estimate the sequence divergence level within the subset of alignment A comprising the 5 most variable

2 J Mol Evol (29) 68:197 24 Fig. 1 Tree obtained in ML analyses of alignment A. The numbers next to the branches indicate bootstrap support bootstrap values 12 1 8 6 4 2 43863 44363 44863 45363 45863 46364 46863 Amborella+outgroup 47363 Ceratophyllum+eudicots Ceratophyllum+magnoliids 47863 48363 48863 49363 49863 lengths of alignments 5363 5863 51363 51863 Amborella+Nymphaea 52363 52863 53363 Ceratophyllum+monocots+eudicots Ceratophyllum+Phalenopsis Fig. 2 Bootstrap support for the various placements of Amborella and Ceratophyllum in the ML trees built on the basis of sorted alignment A and its 19 subsets with decreased variability. The numbers below the graph indicate alignment lengths. The bootstrap values supporting the branch subtending all angiosperms, except Nymphaea and Amborella, were approximately 1% throughout the variability removal process and are not shown positions, we used the following procedure in PAUP*: (1) used Modeltest to find the optimal model for this subset, (2) imported the settings of the best model into PAUP* using the lset command (Swofford 22), (3) set the distance to ml (PAUP* command: dset distance = ml;), (4) constructed a neighbor-joining (NJ) tree (PAUP* command: nj;), and (5) exported the distance matrix used to build the tree into a file (5.matrix, supplementary materials) using the savedist command (Swofford 22). The mean distance in this matrix is 43.23 s/s. The mean distance among the angiosperms in this matrix is 36.6 s/s. We wished to see if there is any correlation among the distances between Amborella and other species calculated on the basis of the 5-position subset of the most divergent positions and the rest of the alignment. We reproduced the distance matrixes using Tree-Puzzle v. 5.2 (Strimmer and von Haeseler 1996), each time setting the rate parameters of GTR? G model to those calculated with the help of PAUP. This was done in order to be able to see the part of the graphs depicting distances lower than 9 s/s. PAUP matrices contained distances higher than 1 s/s, so the part of the graph

J Mol Evol (29) 68:197 24 21 Fig. 3 Tree topology obtained in ML analyses of the subsets of alignment A with the 2 to 5 most variable positions removed. The tree presented was obtained on the basis of a 48,363 position-long subset below 1 s/s would appear flattened. Tree-Puzzle sets very high (and therefore very unreliable) distances to approximately 9 s/s. The resulting dot plot is presented in Fig. 4a. There is no visible correlation in distances estimated from the 5-position subset and from the rest of alignment A. With the subset comprising the 1 most variable positions, PAUP could not build an NJ tree using the model settings suggested by Modeltest; therefore we followed an example in the PAUP command reference manual (Swofford 22, p. 57) to fit GTR? I? G to this data. We exported the distance matrix into a file in NEXUS format (1.matrix, supplementary materials). The mean distance in this matrix is 29.79 s/s. The mean distance among the angiosperm OTUs in this matrix is 2.26 s/s. The dot-plot depicting correlations in the distances between Ceratophyllum and other species calculated from the 1-position subset and the rest of alignment A is presented on Fig. 4b. The distribution of distance pairs in Fig. 4b is also nonlinear. Distance calculation for the subset of the 25 most variable positions was conducted as in the case of 5-position subset. The mean distance in this matrix (25.matrix, supplementary materials) is 1.6 s/s. The mean distance among the angiosperm OTUs is 1.3 s/s. The dot-plot depicting correlations in the distances between Ceratophyllum and other species calculated from the 25-position subset and the rest of alignment A is shown in Fig. 4c. The distribution of distance pairs in Fig. 4c becomes less broad. Discussion Previously we reported that the choice of substitution model strongly affects the results of the ML inference of the phylogenetic relations among the major angiosperm lineages (Goremykin et al. 25; Goremykin and Hellwig 26). The results presented here demonstrate that careful choice of the ML model alone is not enough. We observed that the presence of a small proportion of highly variable positions in alignment alters the structure of the angiosperm subtree. We therefore suggest, in the absence of

22 J Mol Evol (29) 68:197 24 ML distances in 5 pos. subset 1 9 8 7 6 5 4 3 2 1 Distances from Amborella to other species. 12 out of 33 distances have maximum value.,1,2,3,4,5 ML distances in alignment A shortened by 5 pos. B) Distances from Ceratophyllum to other species. 4,5 ML distances in 1 pos. subset 4 3,5 3 2,5 2 1,5 1,5 ML distances in alignment A shortened by 1 pos. C) Distances from Ceratophyllum to other species 3 ML distances in 25 pos. subset A) 2,5 2 1,5 1,5,1,2,5,1,15,2,25,3,35,4 ML distances in alignment A shortened by 25 pos. Fig. 4 Correlation in distances in the subsets of alignment A included and excluded from analyses. a Dot-plots depicting correlations in the distances between Amborella and other species calculated from the 5-position subset of the most divergent positions and the rest of alignment A. b Dot-plots depicting correlations in the distances between Ceratophyllum and other species calculated from the 1- position subset of the most divergent positions and the rest of alignment A. (c) Dot-plots depicting correlations in the distances between Ceratophyllum and other species calculated from the 25- position subset of the most divergent positions and the rest of alignment A,3,4,5 better base substitution models, discarding a certain proportion of the most variable sites from the alignment to avoid potential errors (e.g., Felsenstein 1978; Bergsten 25; Jeffroy et al. 26) in phylogeny reconstruction. In our previous studies, we chose to simply remove the divergent third-codon positions from analysis to accomplish this. However, this did not become a widespread practice, perhaps because of recommendations by Stefanovic et al. (24) and Leebens-Mack et al. (25). These investigators advocated inclusion of the third-codon positions into phylogenomic analyses of angiosperm evolution based on cpdna, citing the insignificant changes that these positions introduced to their trees. Here we observed a change caused by inclusion of the variable third-codon positions, which invokes canonical basal-most placement of Amborella, reported in a large number of papers (Mathews and Donoghue 1999, 2; Parkinson et al. 1999; Qiu et al. 1999, 2, 25; Soltis et al. 1999, 2a, b; Barkman et al. 2; Borsch et al. 23; Hilu et al. 23; Stefanovic et al. 24; Leebens- Mack et al. 25; Jansen et al. 27; Moore et al. 27). At the same time, alignment with the third-codon positions removed rather strongly supports a sister-group relation of Amborella and Nymphaea. To appraise which placement of Amborella is more trustworthy, we sorted the characters in the 53,363 position-long, genome-scale alignment according to their variability and repeatedly shortened the sorted alignment from its most divergent end, producing a series of its subsets. Then we built trees from the sequences of these subsets. This procedure allowed us to directly observe the influence of the most variable characters on the results of phylogeny reconstruction. We observed that removal of just 5 of the most variable positions from the alignment lead to the disappearance of the Amborella-basal topology. This 5- position subset exhibits mean distance [3 s/s in comparison among angiosperms and cannot be expected to bear witness to the evolutionary events that happened during the primary radiation of this plant group. The 5-position subset shows no traces of similarity between Pinus OTU and angiosperm OTUs. Its removal is therefore justified. Equally justified therefore is revision of the assertion that plastid genomic data unequivocally support Amborella as the sole sister group of the remaining angiosperms (Jansen et al. 27). Similarly, the sister-group relation between Ceratophyllum and eudicots, recently reported by Moore et al. (27), depends on the presence of the 1 most divergent positions in the alignment A. This 1-position alignment subset also exhibits very high saturation level and is best omitted from this taxonomic level of analysis unless a good

J Mol Evol (29) 68:197 24 23 case can be made in the future as to why it should be used for deep angiosperm phylogeny. The choice between two further alternative placements of Ceratophyllum as a sister group to the clade subtending eudicots plus monocots (which is supported by subsets of alignment A with 1 to 2 positions removed or as a sister to the clade ([Calycanthus,Liriodendron] [Drimys, Piper]) (which is supported by alignments with 25 to 8 positions removed) is more difficult to make. A mean distance among the angiosperms in the subset of the 25 most divergent position is 1.3 s/s. The substitution paths on this divergence level might or might not be well-described by the general time-reversible family of substitution models. To make a certain judgement between these placements, we would need to apply some objective stopping criterion for the removal of moderately saturated sites derived from the comparative performance of different substitution models applied to the different (slightly modified) data. Such methodology is not currently available; however, we are currently working in this area. Strong and consistent bootstrap support of the placement of Ceratophyllum as a sister to magnoliids, through a long phase of variability decrease until the stage characterised by apparent loss of phylogenetic signal, can be interpreted in favour of magnolian affinity of this species. References Barkman TJ, Chenery G, McNeal JR, Lyons-Weile J, Ellisens WJ, Moore G, Wolfe AD, depamphilis CW (2) Independent and combined analyses of sequences from all three genomic compartments converge on the root of flowering plant phylogeny. Proc Natl Acad Sci USA 97:13166 13171 Bergsten J (25) A review of long-branch attraction. Cladistics 21:163 193 Borsch T, Hilu KW, Quandt D, Wilde V, Neinhuis C, Barthlott W (23) Non-coding plastid trnt-trnf sequences reveal a well resolved phylogeny of basal angiosperms. J Evol Biol 16:558 576 Crane PR, Friis EM, Pedersen KR (1995) The origin and early diversification of angiosperms. Nature 374:27 33 Ewing B, Green P (1998) Base-calling of automated sequencer traces using Phred. II. Error probabilities. Genome Res 8:186 194 Felsenstein J (1978) Cases in which parsimony or compatibility methods will be positively misleading. Syst Zool 27:41 41 Goremykin V, Hansmann S, Martin W (1997) Evolutionary analysis of 58 proteins encoded in six completely sequenced chloroplast genomes: Revised molecular estimates of two seed plant divergence times. Plant Syst Evol 26:337 351 Goremykin VV, Holland B, Hirsch-Ernst KI, Hellwig FH (23) The chloroplast genome of the basal angiosperm Calycanthus fertilis structural and phylogenetic analyses. Plant Syst Evol 242:119 135 Goremykin VV, Hirsch-Ernst KI, Wolfl S, Hellwig FH (24) The chloroplast genome of Nymphaea alba: Whole-genome analyses and the problem of identifying the most basal angiosperm. Mol Biol Evol 21:1445 1454 Goremykin VV, Holland B, Hirsch-Ernst KI, Hellwig FH (25) Analysis of Acorus calamus chloroplast genome and its phylogenetic implications. Mol Biol Evol 22:1813 1822 Goremykin VV, Hellwig FH (26) A new test of phylogenetic model fitness addresses the issue of the basal angiosperm phylogeny. Gene 381:81 91 Guindon S, Gascuel O (23) A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst Biol 52:54 696 Hilu KW, Borsch T, Muller K, Soltis DE, Soltis PS, Savolainen V, Chase MW, Powell M, Alice L, Evans R et al (23) Angiosperm phylogeny based on matk sequence information. Am J Bot 9:1758 1776 Hiratsuka J, Shimada H, Whittier R, Ishibashi T, Sakamoto M, Mori M, Kondo C, Honji Y, Sun CR, Meng BY et al (1989) The complete sequence of the rice (Oryza sativa) chloroplast genome: intermolecular recombination between distinct trna genes accounts for a major plastid DNA inversion during the evolution of the cereals. Mol Gen Genet 217:185 194 Jansen RK, Cai Z, Raubeson LA, Daniell H, Depamphilis CW, Leebens-Mack J, Muller KF, Guisinger-Bellian M, Haberle RC, Hansen AK et al (27) Analysis of 81 genes from 64 plastid genomes resolves relationships in angiosperms and identifies genome-scale evolutionary patterns. Proc Natl Acad Sci USA 14:19369 19374 Jeffroy O, Brinkmann H, Delsuc F, Philippe H (26) Phylogenomics: the beginning of incongruence? Trends Genet 22:225 231 Leebens-Mack J, Raubeson LA, Cui LY, Kuehl JV, Fourcade MH, Chumley TW, Boore JL, Jansen RK, de Pamphilis CW (25) Identifying the basal angiosperm node in chloroplast genome phylogenies: sampling one s way out of the Felsenstein zone. Mol Biol Evol 22:1948 1963 Mathews S, Donoghue MJ (1999) The root of angiosperm phylogeny inferred from duplicate phytochrome genes. Science 286:947 95 Mathews S, Donoghue MJ (2) Basal angiosperm phylogeny inferred from duplicate phytochromes A and C. Int J Plant Sci 161(Suppl):S41 S55 Moore MJ, Bell CD, Soltis PS, Soltis DE (27) Using plastid genome-scale data to resolve enigmatic relationships among basal angiosperms. Proc Natl Acad Sci USA 14:19363 19368 Murray MG, Thompson WF (198) Rapid isolation of high molecular weight DNA. Nucleic Acids Res 8:4321 4325 Posada D, Crandall KA (1998) Modeltest: Testing the model of DNA substitution. Bioinformatics 14:817 818 Parkinson CL, Adams KL, Palmer JD (1999) Multigene analyses identify the three earliest lineages of extant flowering plants. Curr Biol 9:1485 1488 Qiu Y-L, Lee J, Bernasconi-Quadroni F, Soltis DE, Soltis PS, Zanis M, Zimmer EA, Chen Z, Savolainen V, Chase MW (1999) The earliest angiosperms: evidence from mitochondrial, plastid and nuclear genomes. Nature 42:44 47 Qiu Y-L, Lee J, Bernasconi-Quadroni F, Soltis DE, Soltis PS, Zanis M, Zimmer EA, Chen Z, Savolainen V, Chase MW (2) Phylogeny of basal angiosperms: analyses of five genes from three genomes. Int J Plant Sci 161(Suppl):S3 S27 Qiu Y-L, Dombrovska O, Lee J, Li L, Whitlock BA, Bernasconi- Quadroni F, Rest JS, Davis CC, Borsch T, Hilu KW et al (25) Phylogenetic analyses of basal angiosperms based on nine plastid, mitochondrial, and nuclear genes. Int J Plant Sci 166:815 842 Soltis PS, Soltis DE, Chase MW (1999) Angiosperm phylogeny inferred from multiple genes as a tool for comparative biology. Nature 42:42 43 Soltis PS, Soltis DE, Zanis MJ, Kim S (2a) Basal lineages of angiosperms: relationships and implications for floral evolution. Int J Plant Sci 161(Suppl):S97 S17

24 J Mol Evol (29) 68:197 24 Soltis DE, Soltis PS, Chase MW, Mort ME, Albach DC, Zanis M, Savolainen V, Hahn WH, Hoot SB, Fay MF et al (2b) Angiosperm phylogeny inferred from 18S rdna, rbcl, and atpb sequences. Bot J Linn Soc 133:381 461 Soltis DE, Albert VA, Savolainen V, Hilu K, Qiu Y-L, Chase MW, Farris JS, Stefanovic S, Rice DW, Palmer JD, Soltis PS (24) Genome-scale data, angiosperm relationships, and ending incongruence : A cautionary tale in phylogenetics. Trends Plants Sci 9:477 483 Staden R, Beal KF, Bonfield JK (2) The Staden package 1998. Meth Mol Biol 132:115 13 Stefanovic S, Rice DW, Palmer JD (24) Long branch attraction, taxon sampling, and the earliest angiosperms: Amborella or monocots? BMC Evol Biol 4:35 Strimmer K, von Haeseler A (1996) Quartet puzzling: a quartet maximum likelihood method for reconstructing tree topologies. Mol Biol Evol 13:964 969 Swofford DL (22) PAUP*: phylogenetic analysis using parsimony (* and other methods). Version 4. Sinauer, Sunderland Tang J, Xia H, Cao M, Zhang X, Zeng W, Hu S, Tong W, Wang J, Wang J, Yu J, Yang H, Zhu Z (24) A comparison of rice chloroplast genomes. Plant Physiol 135:412 42