Supplemental Information: Origin of land plants revisited in the light of sequence

Size: px
Start display at page:

Download "Supplemental Information: Origin of land plants revisited in the light of sequence"

Transcription

1 Supplemental Information: Origin of land plants revisited in the light of sequence contamination and missing data Simon Laurin Lemay, Henner Brinkmann and Hervé Philippe Figure S1. Comparison of the trees published in Finet et al. (2010) (A) with the one of the corresponding decontaminated dataset (B). Both trees were inferred with the PROTMIXWAG model using RAxML and with the CAT+Γ model using PhyloBayes; the corresponding bootstrap values and posterior probabilities are given for each branch, a dot indicating a bootstrap value 95% and a posterior probability The scale bar indicates 1

2 the expected number of substitutions per site. Coleochaetales and Zygnematales are colored in red and green, respectively. A: The tree is inferred from figure S2 of Finet et al. in which Mesostigma and Chlorokybus form a monophyletic group with a bootstrap support of 67%, but Finet et al. erroneously indicated in their Fig. 1 a strong support for their paraphyly (PP=1 and BP>95). B: The tree corresponds to the cleaned dataset. The incongruence test revealed 74 contaminant sequences in the 77 ribosomal protein alignments of Finet et al., and resulted in the removal of 99 sequences (because in 25 cases it was not possible to determine the correct sequence). Note that after removing eight Eucheuma sequences contaminated by a chlorophyte, the support for the monophyly of Eucheuma and Gracilaria becomes maximal in the CAT+Γ analysis (versus 0.54 in the original study of Finet et al.). 2

3 Figure S2. Impact of missing data on phylogenomic inference. The 40 species trees were inferred with the site heterogeneous CATGTR+Γ model using PhyloBayes; in addition to the posterior probabilities, we evaluated statistical support through a bootstrap analysis with the GTR+Γ model using RAxML, a dot indicating a posterior probability 0.99 and a bootstrap value 95%. The scale bar indicates the expected number of substitutions per site. A: Ribosomal protein dataset (11,571 amino acids positions and 4.7 % of missing data). B: Non ribosomal protein dataset (31,729 amino acids positions and 24.1 % of missing data). C: Reduced ribosomal protein dataset with the amount of missing data minimized for the key charophyte taxa (10,033 amino acids positions and 3.7 % of missing data). D: Reduced non ribosomal protein dataset with the amount of missing data minimized for the 3

4 key charophyte taxa (12,327 amino acids positions and 18.4 % of missing data). The three most complete datasets (A, C and D) recovered Zygnematales+Coloechaetales as the sister group of land plants, whereas the most incomplete supported Zygnematales as the sister group of land plants. Otherwise, the trees were identical except for minor differences within angiosperms, gymnosperms and ferns. Figures S3 54: Single gene phylogenies used to validate the contamination of the charophyte sequences detected by blast search. The charophyte sequences used by Finet et al. were added to the alignments maintained in the Philippe lab that contains >1000 eukaryotic species for which genomic and/or transcriptomic data are available and a representative sample of species was selected for phylogenetic inference. The trees were built with the LG+F+Γ model using RAxML. The sequences used by Finet et al. that we considered to be contaminants are indicated by a red arrow. Figure S3 54 have been deposited in the Dryad repository: Figures S55 109: Single gene phylogenies used for the congruence test. The alignments from Finet et al. were used to infer single gene phylogenies using RAxML. Only the 55 ribosomal protein trees with a significant incongruence (bootstrap support 70%) that has been interpreted as a contamination are shown. The 99 sequences that have been removed are indicated in red. Figure S have been deposited in the Dryad repository: 4

5 Table S1: Contaminations detected in alignments of Finet et al. (2010) by Blast search and by congruence test. gene id. Finet et al. gene name Cross contaminations among the seven charophytes sequenced by Finet et al. 1,2 A non charophyte contaminating one of the seven charophytes sequenced by Finet et al. 1,2 A non charophyte contaminating another non charophyte used in Finet et al. dataset 1,2 #Incongru. detected Incongruence detected among angiosperms or gymnosperms NNI 3 LBA 4 0 rpl4b Nitella=Nemertea rpl3 2 2 rpl5 Chaetosphaeridium=Penium rpl rpl7 A Nitella=Penium l12e D Chaetosphaeridium=Penium rpl2 Eucheuma=(Chlorophytes) rpl grc5 0 9 rpl1 Spirogyra=Fungi 1 10 rpl11b Chaetosphaeridium=Penium Volvox=Phytophthora rpl12b Eucheuma=(Chlorophytes) 0 12 rpl13 Chlorella=Ostreoccocus 1 13 rpl16b Chaetosphaeridium=Amoebozoa Gracilaria=Suberites rpl14a Chaetosphaeridium=Penium rpl15a Chaetosphaeridium=Penium Nitella=Fungi rpl17 Eucheuma=(Chlorophytes) rpl18 Chaetosphaeridium=Penium rpl20 Klebsormidium=Chlorokybus rpl19a Chaetosphaeridium=Penium 1 20 rpl21 Eucheuma=(Chlorophytes) 1 21 rpl22 Chaetosphaeridium=Penium Nitella=Penium Spirogyra=~Glaucophyta 1 22 rpl23a Chaetosphaeridium=Penium Volvox=Phytophthora rpl rpl24 Chaetosphaeridium=Penium 3 2 5

6 A Klebsormidium=Chaetosphaeridium 25 rpl26 Nitella=Chaetosphaeridium 1 26 rpl27e rpl27 Chaetosphaeridium=Penium Nitella=Rotifer Eucheuma=(Chlorophytes) Ceratodon=Melampsora tribe3 51 Chaetosphaeridium=Glaucophyta rpl29 Chaetosphaeridium=Fungi Eucheuma=(Green plant) 0 30 rpl30 Penium=Chlorokybus Eucheuma=(Chlorophytes) rpl31 Chaetosphaeridium=Chlorokybus rpl32 Volvox=Phytophthora rpl34 Klebsormidium=Chaetosphaeridium Volvox=Trichoderma 3 34 rpl35 Klebsormidium=Chaetosphaeridium rpl33a Eucheuma=(Green plants) rpl rpl42 Chaetosphaeridium=Penium 2 38 rpl37a Chaetosphaeridium=Fungi Nitella=Fungi rpl43b Chaetosphaeridium=Penium Nitella=Chaetosphaeridium Eucheuma=(Green plants) rpl38 Chaetosphaeridium=Penium Eucheuma=(Green plants) rpl39 Eucheuma=(Chlorophytes) 4 42 rpl40 Chaetosphaeridium=Penium Coleochaete=Spirogyra Klebsormidium=Cyanophora sap40 Chaetosphaeridium=Penium 3 44 rps rps3 Klebsormidium=Chlorokybus rps1 Chaetosphaeridium=Penium rps4 Chaetosphaeridium=Penium Nitella=Penium rps5 Eucheuma=(Chlorophytes) rps6 Chaetosphaeridium=Penium rps7 Klebsormidium=Chaetosphaeridium Eucheuma=(Chlorophytes) rps8 0 6

7 52 rps9 Klebsormidium=Chaetosphaeridium rps10 Penium=Acanthamoeba rps11 Chaetosphaeridium=Penium Eucheuma=(Chlorophytes) l12e A rps13a Nitella=Fungi rps14 Chaetosphaeridium=Chlorokybus 1 58 rps15 Chaetosphaeridium=Penium Nitella=Penium 2 59 rps22a Chaetosphaeridium=Penium rps16 Klebsormidium=Chlorokybus rps17 Chaetosphaeridium=Penium rps rps19 Chaetosphaeridium=Penium Nitella=Penium Coleochaete=Amoebozoa rps tribe3 40 Chaetosphaeridium=Penium Nitella=Chaetosphaeridium rps rps24 Chaetosphaeridium=Chlorokybus Picea=Populus rps rps26 Chaetosphaeridium=Penium Nitella=Penium rps27 Nitella=Penium Chaetosphaeridium=Acanthamoeba 3 Paralogues for Cyanophora, 71 rps27a Chaetosphaeridium=Penium Euchema, Gracilaria, Porphyra and Volvox 1 72 rps28a Klebsormidium=Chaetosphaeridium Nitella=Chaetosphaeridium rps29 Eucheuma=(Chlorophytes) rla2 B Chaetosphaeridium=Penium 1 75 rla2 A Chaetosphaeridium=Penium Nitella=Chaetosphaeridium Cycas=Maconellicoccus Volvox=Trichoderma Eucheuma=(Green plants) 2 7

8 76 rpp The first species is the contaminated one, while the second one is the likely source of contamination. 2 The contaminations also detected by the congruence test are shown in red. 3 NNI: incongruence corresponding to a Nearest Neighbor Interchange, likely due to stochastic error 4 LBA: incongruence likely due to Long Branch Attraction (i.e. the incorrectly located species display an unusually long branch) Table S2: Summary of the contaminations detected in the Finet et al. (2010) dataset. Contaminated Studied organisms Contaminants Penium Spirogy Chaetos Coeloch Nitella Klebsor Chlorok Non charophytes Penium 1(1) 1(1) Spirogyra 2(1) Chaetosphaeridium 29(29) 3(3) 5(2) Coleochaete 1(0) 1(1) Nitella 7(7) 5(2) 5(3) Klebsormidium 6(5) 3(3) 1(1) Chlorokybus Non Charophytes 31 (15) Spirogy: Spirogyra; Chaetos: Chaetosphaeridium; Coleoch: Coleochaete; Klebsor: Klebsormidium; Chlorok: Chlorokybus; #: number of contaminations detected by Blast search. (#): number of contaminations detected by congruence test. 8

9 Table S3: Impact of taxon sampling on the GTR+Γ and CATGTR+Γ inferences. Datasets Ribo Non Ribo Ribo C$ Non Ribo C$ Ribo C$ + Non Ribo C$ #positions 11,571 31,729 10,033 12,327 22,360 % of missing data 4.0 % 4.7 % 22.2 % 26.7 % 2.9 % 3.7 % 16.0 % 20.5 % 9.4 % 12.6 % 40 taxa (63/0.99) Z+E (95/1.00) (75/0.99) (58/0.79) (87/1.00) Taxon sampling 30 taxa* 28 taxa** (71/0.99) (71/0.83) Z+E (75/1.00) Z+E (69/1.00) (80/0.99) (75/0.93) (74/0.79) (73/0.55) (97/0.99) (85/0.96) 27 taxa*** (42/0.99) Z+E (96/1.00) (72/0.99) Z+E (52/0.75) (78/1.00) In brackets are specified bootstrap support values followed by posterior probabilities. Ribo: ribosomal protein dataset; Non Ribo: non ribosomal protein dataset; Ribo C$: ribosomal protein dataset with reduced amount of missin data (complete); Non Ribo C$: complete non ribosomal proteins dataset; Ribo C$ + Non Ribo C$: fusion of the complete ribosomal protein dataset with the complete non ribosomal protein dataset * discarding the distant outgroup, Chlorophyta ** discarding the distant outgroup, Chlorophyta and the intermediate outgroup, Chlorokybus+Mesostigma *** discarding the fast evolving spermatophytes (gymnosperms and angiosperms) : Zygnematales+Coleochaetales Z+E: Zygnematales+Embryophyta 9

10 Supplemental Experimental procedures 1. Reanalysis of the Finet et al. dataset A rapid survey of the single gene trees of ribosomal proteins used by Finet et al. [S1] revealed several cases where the amino acid sequences from two or more distantly related taxa (e.g. Chaetosphaeridium and Penium) were identical. This is highly unexpected, because these organisms have diverged several hundred million years ago, and their sequences, even for the slow evolving ribosomal proteins, should have accumulated some substitutions (as they did with respect to other, more closely related, species, e.g. Coleochaete, Closterium and Spirogyra). The simplest interpretation of these identical sequences is that cross contamination occurred between these species at some stages of data collection (culture, RNA extraction, library construction, DNA sequencing or bioinformatic analysis). Detecting contaminations and non orthology issues is difficult, because a single gene does not contain sufficient phylogenetic signal to produce a fully resolved tree, i.e. stochastic errors may play an important role. Therefore we used two complementary approaches. First, when building the single gene alignments, we blasted the new sequences to verify that they were more similar to their expected close relatives, and, when not, we verified with a taxon rich phylogenetic analysis [S2]: this allowed us for instance to demonstrate several contaminations of the animals sequenced by Dunn et al. [S3] by parasites, e.g. acoels by microsporidia [S2]. Second, after the building of the super matrix, we looked for bipartitions in all single gene trees that received a bootstrap support higher than 70% and 10

11 that are incongruent with the super matrix based tree [S4]. The underlying assumption is that the super matrix based tree constitutes a reasonable approximation of the species tree and can therefore be used as a reference. Although this protocol has been implemented in perl [S5], the analysis of the incongruent bipartitions requires human expertise, to interpret the likely cause of the incongruence. In most cases, incongruences are likely due to stochastic errors (i.e. trees can be made congruent by a single nearest neighbour interchange) and systematic errors (i.e. long branch attraction). For the very few remaining cases we encountered in our phylogenomic studies [S4 8], the cause is more difficult to assess, but when contaminations, horizontal gene transfers or paralogy were credible options, the offending sequences and more generally the corresponding genes were removed from further analysis. In this way, we manually constructed well curated alignments of ~300 proteins from >1000 eukaryotic species that served as a reference dataset. 1.1 Detection of contaminations by blast searches We blasted each amino acid sequence of the Finet et al. dataset against this taxon rich reference dataset and identified all cases where the highest scoring sequence does not belong to the same species as the query. We excluded angiosperms from this analysis because our reference dataset did not contain all species used by Finet et al. Among the 410 differences, most were within gymnosperms, ferns and mosses and are easily explained by the fact that the sequences were identical and that the first hit was by chance not the one from the seed species or that the sequence from the correct species lacked a few amino acid positions and therefore had a slightly lower score. Yet 101 remaining differences (Tables S1 11

12 and S2) cannot be easily explained. In 55 cases two distantly related charophyte species sequenced by Finet et al. (Chaetosphaeridium, Chlorokybus, Coleochaete, Klebsormidium, Nitella, Penium, Spirogyra) had identical or almost identical sequences, while in 46 cases a Finet et al. sequence had the highest score with a very distantly related one, but the sequences were not identical (e.g. the red alga Euchema matched a chlorophyte, or the charalean Nitella matched a rotifer). 1.2 Confirmation of contaminations by phylogenetic analyses For the cases where two distantly related species shared identical amino acid sequences, we used phylogenetic analysis including data from other charophytes [S8], in particular Klebsormidium subtile (versus Klebsormidium flaccidum in Finet et al. [S1]), Coleochaete scutata (versus Coleochaete orbicularis in Finet et al. [S1]), and Chara vulgaris, as well as the charophyte sequences from Finet et al. [S1] that we independently incorporated into our alignment, and analyzed the nucleotide sequences (see 1.3). As shown in Fig. S3 54, when two species display identical sequences in the Finet et al. dataset, we almost always have an additional sequence from one of the two species, a sequence that generally has the expected phylogenetic position. For instance, for rpl20 (Fig. S13), the Klebsormidium flaccidum sequence of Finet et al. [S1] is identical to the one of Chlorokybus, but another sequence from Klebsormidium flaccidum is almost identical to a sequence from Klebsormidium subtile, strongly suggesting that the Klebsormidium flaccidum sequence of Finet et al. corresponds to a contamination by Chlorokybus. For rpl7 A (Fig. S29), the Nitella sequence of Finet et al. is identical to the one of Penium, but another sequence from Nitella is the sister group of a 12

13 sequence from Chara (i.e. forming monophyletic Charales), strongly suggesting that the Nitella sequence of Finet et al. corresponds to a contamination by Penium. 1.3 Confirmation of contaminations at the nucleotide level To further confirm that, when two distantly related species shared identical amino acid sequences, one sequence was not genuine, but was a contamination, we looked at the nucleotide sequences. First, the program forty (Denis Baurain, unpublished, available upon request) that allows us to add new sequences into our alignment allows us to keep all DNA sequences; it was used to extract from the EST sequences the DNA sequences that display similarity to the amino acid sequence of interest. More precisely, forty blasted the amino acid sequence of Arabidopsis against the EST DNA sequences of charophytes and extracted all sequences that had an E value lower than 1e 10. We then blasted these DNA sequences against amino acid alignments that combined our reference alignment and the Finet et al. alignment, and recorded the best hit. As expected, for all cases where the sequence of a charophyte had an identical sequence in Finet et al. and in our alignments, the number of hits to this sequence was much higher than to other sequences. For instance, for the 29 cases where Chaetosphaeridium and Penium have an identical amino acid sequence, only 13 reads from Chaetosphaeridium correspond to this sequence, whereas 595 reads from Penium do. This strongly suggests that when two distantly related species shared identical amino acid sequences the one with the least hits is a contamination. In addition, it should be noted that hits to other species are not exceptional, including other charophytes, but also very distantly related eukaryotes such as Fungi, Rotifera, Amoebozoa or Glaucocystophyta. 13

14 Our assumption is that, if it is a contamination, the nucleotide sequences corresponding to identical amino acid sequences should be identical for the two species (except for sequencing errors). To test that hypothesis, we looked for possible SNPs between the nucleotide sequences from species A and the ones from species B, when the amino acid sequence of species A and B were identical in Finet et al. For each gene, TBLASTN hits that were at least 99% identical to the protein query for at least 100 nt were collected separately for each organism (Chaetosphaeridium, Chlorokybus, Klebsormidium, Nitella, Penium) and in an additional file all available organisms were merged. Sequence files were fed to CAP3 [S9] for contig assembly using default parameters and the CAP3 output was then used for SNP prediction using a custom re implementation of the SNPIDENTIFIER pipeline [S10]. Within each CAP3 contig, assembled reads were first pre processed to improve overall sequence quality. Briefly, reads with more than 10% ambiguous bases (N) were discarded whereas retained reads were further conservatively clipped at their extremities (30 nt in 5 and 20% of read length in 3 ). SNPs were then detected on a per column basis by comparison with the consensus sequence of each contig. To call a SNP, its minor allele had to comply with the following criteria: (1) to be observed in at least two reads and at a minimum frequency of 1% of the pre processed reads; (2) to be centered in a 31 nt window perfectly matching the consensus sequence (ignoring shared gaps when counting). While the first criterion allowed detecting SNPs introduced by a very low number of reads, the second criterion avoided calling many false positive SNPs associated with sequencing errors. Due to slightly different 14

15 contig lengths preventing position based identification, SNPs were compared across organisms using their 31 nt conserved window as a primary key during the join step. Out of the 44 genes for which cross contaminations were detected, 11 had no predicted SNP, whereas 15 had 1 10 SNPs, 13 had SNPs and 5 had SNPs. Among the 423 predicted SNPs, 33 initially appeared to be restricted to merged sequence files and thus absent from single organism files. However, close inspection of the CAP3 output files demonstrated that 29 were actually polymorphic in at least one single organism. These SNPs had been either not called or not properly joined across predictions due to neighboring SNPs that altered their 31 nt conserved window in single organism files relative to the corresponding merged sequence file. This left us with 4 most likely true SNPs that were indeed organism specific (1 in rpl11b and 3 in rps27a). Therefore, we conclude that in spite of their purported organism affiliations, the overwhelming majority of the reads analyzed in this SNP calling experiment actually come from a single DNA source, which strongly suggests at least some form of cross contamination between the five considered libraries. 1.4 Detection of contaminations by a congruence test All analyses we performed demonstrated that, when two distantly related species shared identical amino acid sequences, the most obvious explanation, a cross species contamination, was correct. It is unclear why these numerous contaminations were not detected by Finet et al. because they stated on p. 2220: The integrity of the data set and especially the possible contamination status were verified by inferring independent trees 15

16 for independent marker genes using PhyML and the WAG+Γ model. We carefully checked each of the trees for cases exhibiting a well supported branch (bootstrap percentage > 70%) incongruent with our concatenated analysis. We applied our congruence software to the Finet et al. dataset and detected all bipartitions supported by a bootstrap value > 70% and incongruent with the super matrix based tree of Finet et al. As detailed in table S1, most of the incongruent bipartitions correspond to the contaminations we detected previously. The remaining incongruencies can be explained (i) as usually [S4 8], by stochastic errors since they corresponded to nearest neighbour interchanges, and (ii) by incorrect identification of the orthologs within angiosperms and gymnosperms. We observed that many ribosomal proteins were duplicated within these two clades, but since the phylogenetic signal is scarce for these slow evolving proteins among closely related species (i.e., within genus or at most within family) detecting the precise history of duplications is often impossible. We considered that these potential non orthology issues within angiosperms and gymnosperms were not problematic for detecting the sister group of land plants, but that the phylogeny within angiosperms and gymnosperms cannot be confidently inferred (however, the phylogeny is congruent with expectations, suggesting that non orthology is not highly problematic). The congruence protocol detected most, but not all, of the contaminations that we detected with the protocol described above (50 out 55 cross contaminations, Table S2). We decided to remove all erroneous sequences that created incongruence. It was not always possible to detect which species was contaminated. However, when Penium and Chaetosphaeridium were identical and this group clustered with the other Zygnematales (Closterium and 16

17 Spirogyra), we considered that Chaetosphaeridium was contaminated by Penium and was therefore removed from the dataset. If the phylogeny did not provide sufficient information, both sequences were removed. Our choices were highlighted on the single gene trees (Fig. S55 109). In total, 99 sequences (corresponding to 74 detected contaminations) were removed. As shown in table S2, contaminations were highly biased, in particular with many contaminations of Chaetosphaeridium by Penium, which could explain the non monophyly of Coleochaetales and Zygnematales obtained by Finet et al. 27 contaminations were not detected by the congruence protocol and therefore not removed. We analysed the cleaned Finet et al. dataset using the same methods as Finet et al., that is with the WAG+Γ model using RAxML [S11] and the CAT+Γ model using PhyloBayes [S12]. 2. Construction of an updated dataset The same protocol as in Wodniok et al. [S8] was used. Briefly, we started with the set of ~300 orthologous eukaryotic proteins that are regularly updated with new EST and genomic data in the Philippe s lab and that were used in previous phylogenomic studies [S4 7]. We selected 40 Viridiplantae species based on the following criteria: (i) including all charophyte species, because they are key to the question of the origin of land plants, (ii) including only the most slowly evolving and the most complete species, to decrease the risk of systematic error and increase the phylogenetic signal, respectively, (iii) excluding closely related species, to reduce the environmental footprint due to the computational burden and to allow the use of the complex CATGTR+Γ model [S13]. Contrary to Finet et al. [S1], we did not include Glaucocystophyta and Rhodophyta, because they are distantly related and will likely provide more noise than signal for studying the origin of land plants. The 17

18 unambiguously aligned regions were detected with Gblocks [S14] and the concatenation was constructed using SCaFoS [S15]. Only proteins with more than 30 species available were retained, yielding an alignment of 43,300 positions from 164 proteins. We applied the congruence protocol [S4] and did not detect any incongruence that cannot be explained by stochastic or systematic errors (i.e., incongruences corresponding to nearest neighbour interchange or to taxa with unusually long branches, respectively; data not shown). Because corroboration between independent datasets is key to solve difficult phylogenetic questions, we divided our dataset into ribosomal (11,571 positions and 4.7 % of missing data) and non ribosomal (31,729 positions and 24.1 % of missing data) datasets. 3. Phylogenetic inference and missing data We used cross validation to determine the best fitting model among WAG+F, LG+F, GTR, CAT and CATGTR (the heterogeneity among sites being modelled by a Gamma distribution), as described in [S16]. The analysis was performed in Phylobayes version 3.0, using ten randomly generated replicates, in which the original data set was divided into training data sets (9/10 of positions) to estimate the parameters of the given model and into the test data sets (1/10 of positions) to calculate with these parameters the likelihood scores. In agreement with previous studies [S4, S6 8, S16, S17], the rank of the models was WAG+F < LG+F < GTR < CAT < CATGTR. We therefore inferred all phylogenies with the sitehomogeneous GTR model using RAxML and the site heterogeneous CATGTR model using PhyloBayes. To estimate statistical support, bootstrap on positions was used for the GTR model, and jackknife of genes (with random sampling of 66% of the genes, i.e. similar to the number of characters that are discarded in a bootstrap replicate) for the CATGTR model. 18

19 Missing data can decrease the accuracy of phylogenomic inference, because this reduces the effective number of species, hence hampers the ability to detect multiple substitutions (unpublished results). To evaluate whether the incongruence between the ribosomal and non ribosomal datasets could be due to a higher level of missing data in the second dataset, we decided to build a more complete alignment. However, there were only 3,345 complete positions out of 43,300, likely insufficient to obtain a significant statistical support. We therefore decided to maximize the amount of known character states in the part of the tree we were interested in, i.e. the relationships among the paraphyletic charophyte algae and embryophytes. In addition, 7 out of the 10 Chlorophyta were almost complete (less than 7% of missing data) as well as 11 out of the 21 Embryophyta, suggesting that the ancestral sequence of Chlorophyta and Embryophyta, which are of prime importance for the question of interest, will not be seriously affected by missing data. We therefore removed positions if Closterium and Penium were absent, if Spirogyra was absent, if Chaetosphaeridum was absent, if Coleochaete was absent, if Chara and Nitella were absent, if Klebsormidium was absent and if Chlorokybus was absent. About 50% of the positions (20,940) were removed, allowing us to reduce the amount of missing data from 18.9% to 11.8%. 4. Impact of taxon sampling As Wodniok et al. [S8] find either Zygnematales+Embryophyta or Zygnematales+Coleochaetales depending on the taxon sampling used, we studied the effect of varying taxon sampling on our phylogenetic inference. Three additional datasets were created via subsampling of our 40 taxa: 19

20 discarding the distant outgroup, Chlorophyta (30 taxa), discarding the distant outgroup, Chlorophyta and the intermediate outgroup, Chlorokybus+Mesostigma (28 taxa), discarding the fast evolving embryophytes (27 taxa). The inferred trees were almost insensitive to taxon sampling (Table S3). In particular, we observed the transition from Zygnematales+Embryophyta to Zygnematales+Coleochaetales when missing data were reduced in the non ribosomal dataset (except for the 27 species subsample). Our results were therefore relatively robust to taxon sampling. However, it should be noticed that the number of charophyte species is reduced (9) and therefore analyses with denser sampling are certainly needed. Supplemental references S1. Finet, C., Timme, R.E., Delwiche, C.F., and Marletaz, F. (2010). Multigene phylogeny of the green lineage reveals the origin and diversification of land plants. Curr Biol 20, S2. Philippe, H., Brinkmann, H., Lavrov, D.V., Littlewood, D.T., Manuel, M., Worheide, G., and Baurain, D. (2011). Resolving difficult phylogenetic questions: why more sequences are not enough. PLoS Biol 9, e S3. Dunn, C.W., Hejnol, A., Matus, D.Q., Pang, K., Browne, W.E., Smith, S.A., Seaver, E., Rouse, G.W., Obst, M., Edgecombe, G.D., et al. (2008). Broad phylogenomic sampling improves resolution of the animal tree of life. Nature 452,

21 S4. Philippe, H., Derelle, R., Lopez, P., Pick, K., Borchiellini, C., Boury Esnault, N., Vacelet, J., Renard, E., Houliston, E., Queinnec, E., et al. (2009). Phylogenomics revives traditional views on deep animal relationships. Curr Biol 19, S5. Baurain, D., Brinkmann, H., Petersen, J., Rodriguez Ezpeleta, N., Stechmann, A., Demoulin, V., Roger, A.J., Burger, G., Lang, B.F., and Philippe, H. (2010). Phylogenomic evidence for separate acquisition of plastids in cryptophytes, haptophytes, and stramenopiles. Mol Biol Evol 27, S6. Philippe, H., Brinkmann, H., Copley, R.R., Moroz, L.L., Nakano, H., Poustka, A.J., Wallberg, A., Peterson, K.J., and Telford, M.J. (2011). Acoelomorph flatworms are deuterostomes related to Xenoturbella. Nature 470, S7. Rota Stabelli, O., Campbell, L., Brinkmann, H., Edgecombe, G.D., Longhorn, S.J., Peterson, K.J., Pisani, D., Philippe, H., and Telford, M.J. (2011). A congruent solution to arthropod phylogeny: phylogenomics, micrornas and morphology support monophyletic Mandibulata. Proc Biol Sci 278, S8. Wodniok, S., Brinkmann, H., Glockner, G., Heidel, A.J., Philippe, H., Melkonian, M., and Becker, B. (2011). Origin of land plants: Do conjugating green algae hold the key? BMC Evol Biol 11, 104. S9. Huang, X., and Madan, A. (1999). CAP3: A DNA sequence assembly program. Genome Res 9, S10. Gorbach, D.M., Hu, Z.L., Du, Z.Q., and Rothschild, M.F. (2009). SNP discovery in Litopenaeus vannamei with a new computational pipeline. Anim Genet 40, S11. Stamatakis, A. (2006). RAxML VI HPC: maximum likelihood based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics 22,

22 S12. Lartillot, N., Lepage, T., and Blanquart, S. (2009). PhyloBayes 3: a Bayesian software package for phylogenetic reconstruction and molecular dating. Bioinformatics 25, S13. Lartillot, N., and Philippe, H. (2004). A Bayesian mixture model for across site heterogeneities in the amino acid replacement process. Mol. Biol. Evol. 21, S14. Castresana, J. (2000). Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Mol Biol Evol 17, S15. Roure, B., Rodriguez Ezpeleta, N., and Philippe, H. (2007). SCaFoS: a tool for Selection, Concatenation and Fusion of Sequences for phylogenomics. BMC Evol Biol 7 Suppl 1, S2. S16. Lartillot, N., and Philippe, H. (2008). Improvement of molecular phylogenetic inference and the phylogeny of Bilateria. Philos Trans R Soc Lond B Biol Sci 363, S17. Lartillot, N., Brinkmann, H., and Philippe, H. (2007). Suppression of long branch attraction artefacts in the animal phylogeny using a site heterogeneous model. BMC Evol Biol 7 Suppl 1, S4. 22

C3020 Molecular Evolution. Exercises #3: Phylogenetics

C3020 Molecular Evolution. Exercises #3: Phylogenetics C3020 Molecular Evolution Exercises #3: Phylogenetics Consider the following sequences for five taxa 1-5 and the known outgroup O, which has the ancestral states (note that sequence 3 has changed from

More information

Phylogenetic Tree Reconstruction

Phylogenetic Tree Reconstruction I519 Introduction to Bioinformatics, 2011 Phylogenetic Tree Reconstruction Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Evolution theory Speciation Evolution of new organisms is driven

More information

Impact of Missing Data on Phylogenies Inferred from Empirical Phylogenomic Data Sets

Impact of Missing Data on Phylogenies Inferred from Empirical Phylogenomic Data Sets Impact of Missing Data on Phylogenies Inferred from Empirical Phylogenomic Data Sets Béatrice Roure, 1 Denis Baurain, z,2 and Hervé Philippe*,1 1 Département de Biochimie, Centre Robert-Cedergren, Université

More information

Dr. Amira A. AL-Hosary

Dr. Amira A. AL-Hosary Phylogenetic analysis Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic Basics: Biological

More information

Anatomy of a species tree

Anatomy of a species tree Anatomy of a species tree T 1 Size of current and ancestral Populations (N) N Confidence in branches of species tree t/2n = 1 coalescent unit T 2 Branch lengths and divergence times of species & populations

More information

Phylogenetic inference

Phylogenetic inference Phylogenetic inference Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, March 7 th 016 After this lecture, you can discuss (dis-) advantages of different information types

More information

Elements of Bioinformatics 14F01 TP5 -Phylogenetic analysis

Elements of Bioinformatics 14F01 TP5 -Phylogenetic analysis Elements of Bioinformatics 14F01 TP5 -Phylogenetic analysis 10 December 2012 - Corrections - Exercise 1 Non-vertebrate chordates generally possess 2 homologs, vertebrates 3 or more gene copies; a Drosophila

More information

LAB 4: PHYLOGENIES & MAPPING TRAITS

LAB 4: PHYLOGENIES & MAPPING TRAITS LAB 4: PHYLOGENIES & MAPPING TRAITS *This is a good day to check your Physcomitrella (protonema, buds, gametophores?) and Ceratopteris cultures (embryos, young sporophytes?)* Phylogeny Introduction The

More information

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic analysis Phylogenetic Basics: Biological

More information

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

Bioinformatics tools for phylogeny and visualization. Yanbin Yin Bioinformatics tools for phylogeny and visualization Yanbin Yin 1 Homework assignment 5 1. Take the MAFFT alignment http://cys.bios.niu.edu/yyin/teach/pbb/purdue.cellwall.list.lignin.f a.aln as input and

More information

1 ATGGGTCTC 2 ATGAGTCTC

1 ATGGGTCTC 2 ATGAGTCTC We need an optimality criterion to choose a best estimate (tree) Other optimality criteria used to choose a best estimate (tree) Parsimony: begins with the assumption that the simplest hypothesis that

More information

Hillis DM Inferring complex phylogenies. Nature 383:

Hillis DM Inferring complex phylogenies. Nature 383: Hillis DM. 1996. Inferring complex phylogenies. Nature 383: 130-131. Triangles: parsimony Squares: neighbor-joining (under specified model) Circles: UPGMA Designing your phylogenetic analysis Choice of

More information

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics - in deriving a phylogeny our goal is simply to reconstruct the historical relationships between a group of taxa. - before we review the

More information

8/23/2014. Phylogeny and the Tree of Life

8/23/2014. Phylogeny and the Tree of Life Phylogeny and the Tree of Life Chapter 26 Objectives Explain the following characteristics of the Linnaean system of classification: a. binomial nomenclature b. hierarchical classification List the major

More information

UoN, CAS, DBSC BIOL102 lecture notes by: Dr. Mustafa A. Mansi. The Phylogenetic Systematics (Phylogeny and Systematics)

UoN, CAS, DBSC BIOL102 lecture notes by: Dr. Mustafa A. Mansi. The Phylogenetic Systematics (Phylogeny and Systematics) - Phylogeny? - Systematics? The Phylogenetic Systematics (Phylogeny and Systematics) - Phylogenetic systematics? Connection between phylogeny and classification. - Phylogenetic systematics informs the

More information

Phylogenomics: the beginning of incongruence?

Phylogenomics: the beginning of incongruence? Phylogenomics: the beginning of incongruence? Olivier Jeffroy, Henner Brinkmann, Frédéric Delsuc, Hervé Philippe To cite this version: Olivier Jeffroy, Henner Brinkmann, Frédéric Delsuc, Hervé Philippe.

More information

Estimating Evolutionary Trees. Phylogenetic Methods

Estimating Evolutionary Trees. Phylogenetic Methods Estimating Evolutionary Trees v if the data are consistent with infinite sites then all methods should yield the same tree v it gets more complicated when there is homoplasy, i.e., parallel or convergent

More information

CHAPTERS 24-25: Evidence for Evolution and Phylogeny

CHAPTERS 24-25: Evidence for Evolution and Phylogeny CHAPTERS 24-25: Evidence for Evolution and Phylogeny 1. For each of the following, indicate how it is used as evidence of evolution by natural selection or shown as an evolutionary trend: a. Paleontology

More information

Report. Multigene Phylogeny of the Green Lineage Reveals the Origin and Diversification of Land Plants

Report. Multigene Phylogeny of the Green Lineage Reveals the Origin and Diversification of Land Plants Current Biology 20, 2217 2222, December 21, 2010 ª2010 Elsevier Ltd All rights reserved DOI 10.1016/j.cub.2010.11.035 Multigene Phylogeny of the Green Lineage Reveals the Origin and Diversification of

More information

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees Erin Molloy and Tandy Warnow {emolloy2, warnow}@illinois.edu University of Illinois at Urbana

More information

Phylogenetics - Orthology, phylogenetic experimental design and phylogeny reconstruction. Lesser Tenrec (Echinops telfairi)

Phylogenetics - Orthology, phylogenetic experimental design and phylogeny reconstruction. Lesser Tenrec (Echinops telfairi) Phylogenetics - Orthology, phylogenetic experimental design and phylogeny reconstruction Lesser Tenrec (Echinops telfairi) Goals: 1. Use phylogenetic experimental design theory to select optimal taxa to

More information

Consensus Methods. * You are only responsible for the first two

Consensus Methods. * You are only responsible for the first two Consensus Trees * consensus trees reconcile clades from different trees * consensus is a conservative estimate of phylogeny that emphasizes points of agreement * philosophy: agreement among data sets is

More information

"PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200B Spring 2009 University of California, Berkeley

PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION Integrative Biology 200B Spring 2009 University of California, Berkeley "PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200B Spring 2009 University of California, Berkeley B.D. Mishler Jan. 22, 2009. Trees I. Summary of previous lecture: Hennigian

More information

MiGA: The Microbial Genome Atlas

MiGA: The Microbial Genome Atlas December 12 th 2017 MiGA: The Microbial Genome Atlas Jim Cole Center for Microbial Ecology Dept. of Plant, Soil & Microbial Sciences Michigan State University East Lansing, Michigan U.S.A. Where I m From

More information

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega BLAST Multiple Sequence Alignments: Clustal Omega What does basic BLAST do (e.g. what is input sequence and how does BLAST look for matches?) Susan Parrish McDaniel College Multiple Sequence Alignments

More information

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree) I9 Introduction to Bioinformatics, 0 Phylogenetic ree Reconstruction Yuzhen Ye (yye@indiana.edu) School of Informatics & omputing, IUB Evolution theory Speciation Evolution of new organisms is driven by

More information

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot

More information

Chapter 26: Phylogeny and the Tree of Life Phylogenies Show Evolutionary Relationships

Chapter 26: Phylogeny and the Tree of Life Phylogenies Show Evolutionary Relationships Chapter 26: Phylogeny and the Tree of Life You Must Know The taxonomic categories and how they indicate relatedness. How systematics is used to develop phylogenetic trees. How to construct a phylogenetic

More information

Session 5: Phylogenomics

Session 5: Phylogenomics Session 5: Phylogenomics B.- Phylogeny based orthology assignment REMINDER: Gene tree reconstruction is divided in three steps: homology search, multiple sequence alignment and model selection plus tree

More information

Phylogenetics in the Age of Genomics: Prospects and Challenges

Phylogenetics in the Age of Genomics: Prospects and Challenges Phylogenetics in the Age of Genomics: Prospects and Challenges Antonis Rokas Department of Biological Sciences, Vanderbilt University http://as.vanderbilt.edu/rokaslab http://pubmed2wordle.appspot.com/

More information

PHYLOGENY AND SYSTEMATICS

PHYLOGENY AND SYSTEMATICS AP BIOLOGY EVOLUTION/HEREDITY UNIT Unit 1 Part 11 Chapter 26 Activity #15 NAME DATE PERIOD PHYLOGENY AND SYSTEMATICS PHYLOGENY Evolutionary history of species or group of related species SYSTEMATICS Study

More information

Phylogenetic relationship among S. castellii, S. cerevisiae and C. glabrata.

Phylogenetic relationship among S. castellii, S. cerevisiae and C. glabrata. Supplementary Note S2 Phylogenetic relationship among S. castellii, S. cerevisiae and C. glabrata. Phylogenetic trees reconstructed by a variety of methods from either single-copy orthologous loci (Class

More information

Supplementary Materials for

Supplementary Materials for advances.sciencemag.org/cgi/content/full/1/8/e1500527/dc1 Supplementary Materials for A phylogenomic data-driven exploration of viral origins and evolution The PDF file includes: Arshan Nasir and Gustavo

More information

Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies

Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies 1 What is phylogeny? Essay written for the course in Markov Chains 2004 Torbjörn Karfunkel Phylogeny is the evolutionary development

More information

Smith et al. American Journal of Botany 98(3): Data Supplement S2 page 1

Smith et al. American Journal of Botany 98(3): Data Supplement S2 page 1 Smith et al. American Journal of Botany 98(3):404-414. 2011. Data Supplement S1 page 1 Smith, Stephen A., Jeremy M. Beaulieu, Alexandros Stamatakis, and Michael J. Donoghue. 2011. Understanding angiosperm

More information

Department of Computer Science, Technical University of Munich, Bolzmannstr. 3, 85747, Garching b. Mu nchen, Germany 2

Department of Computer Science, Technical University of Munich, Bolzmannstr. 3, 85747, Garching b. Mu nchen, Germany 2 Phylogenetic Bootstrapping under Resource Constraints: Higher Model Accuracy or more Replicates? Alexandros Stamatakis 1* and Vincent Rousset 2 1 Department of Computer Science, Technical University of

More information

BLAST. Varieties of BLAST

BLAST. Varieties of BLAST BLAST Basic Local Alignment Search Tool (1990) Altschul, Gish, Miller, Myers, & Lipman Uses short-cuts or heuristics to improve search speed Like speed-reading, does not examine every nucleotide of database

More information

PhyQuart-A new algorithm to avoid systematic bias & phylogenetic incongruence

PhyQuart-A new algorithm to avoid systematic bias & phylogenetic incongruence PhyQuart-A new algorithm to avoid systematic bias & phylogenetic incongruence Are directed quartets the key for more reliable supertrees? Patrick Kück Department of Life Science, Vertebrates Division,

More information

(Stevens 1991) 1. morphological characters should be assumed to be quantitative unless demonstrated otherwise

(Stevens 1991) 1. morphological characters should be assumed to be quantitative unless demonstrated otherwise Bot 421/521 PHYLOGENETIC ANALYSIS I. Origins A. Hennig 1950 (German edition) Phylogenetic Systematics 1966 B. Zimmerman (Germany, 1930 s) C. Wagner (Michigan, 1920-2000) II. Characters and character states

More information

Cladistics and Bioinformatics Questions 2013

Cladistics and Bioinformatics Questions 2013 AP Biology Name Cladistics and Bioinformatics Questions 2013 1. The following table shows the percentage similarity in sequences of nucleotides from a homologous gene derived from five different species

More information

Phylogenetics. Applications of phylogenetics. Unrooted networks vs. rooted trees. Outline

Phylogenetics. Applications of phylogenetics. Unrooted networks vs. rooted trees. Outline Phylogenetics Todd Vision iology 522 March 26, 2007 pplications of phylogenetics Studying organismal or biogeographic history Systematics ating events in the fossil record onservation biology Studying

More information

A (short) introduction to phylogenetics

A (short) introduction to phylogenetics A (short) introduction to phylogenetics Thibaut Jombart, Marie-Pauline Beugin MRC Centre for Outbreak Analysis and Modelling Imperial College London Genetic data analysis with PR Statistics, Millport Field

More information

Origins of Life. Fundamental Properties of Life. Conditions on Early Earth. Evolution of Cells. The Tree of Life

Origins of Life. Fundamental Properties of Life. Conditions on Early Earth. Evolution of Cells. The Tree of Life The Tree of Life Chapter 26 Origins of Life The Earth formed as a hot mass of molten rock about 4.5 billion years ago (BYA) -As it cooled, chemically-rich oceans were formed from water condensation Life

More information

Basic Local Alignment Search Tool

Basic Local Alignment Search Tool Basic Local Alignment Search Tool Alignments used to uncover homologies between sequences combined with phylogenetic studies o can determine orthologous and paralogous relationships Local Alignment uses

More information

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types Exp 11- THEORY Sequence Alignment is a process of aligning two sequences to achieve maximum levels of identity between them. This help to derive functional, structural and evolutionary relationships between

More information

Using Bioinformatics to Study Evolutionary Relationships Instructions

Using Bioinformatics to Study Evolutionary Relationships Instructions 3 Using Bioinformatics to Study Evolutionary Relationships Instructions Student Researcher Background: Making and Using Multiple Sequence Alignments One of the primary tasks of genetic researchers is comparing

More information

Bootstrapping and Tree reliability. Biol4230 Tues, March 13, 2018 Bill Pearson Pinn 6-057

Bootstrapping and Tree reliability. Biol4230 Tues, March 13, 2018 Bill Pearson Pinn 6-057 Bootstrapping and Tree reliability Biol4230 Tues, March 13, 2018 Bill Pearson wrp@virginia.edu 4-2818 Pinn 6-057 Rooting trees (outgroups) Bootstrapping given a set of sequences sample positions randomly,

More information

Phylogeny and systematics. Why are these disciplines important in evolutionary biology and how are they related to each other?

Phylogeny and systematics. Why are these disciplines important in evolutionary biology and how are they related to each other? Phylogeny and systematics Why are these disciplines important in evolutionary biology and how are they related to each other? Phylogeny and systematics Phylogeny: the evolutionary history of a species

More information

Constructing Evolutionary/Phylogenetic Trees

Constructing Evolutionary/Phylogenetic Trees Constructing Evolutionary/Phylogenetic Trees 2 broad categories: istance-based methods Ultrametric Additive: UPGMA Transformed istance Neighbor-Joining Character-based Maximum Parsimony Maximum Likelihood

More information

Constructing Evolutionary/Phylogenetic Trees

Constructing Evolutionary/Phylogenetic Trees Constructing Evolutionary/Phylogenetic Trees 2 broad categories: Distance-based methods Ultrametric Additive: UPGMA Transformed Distance Neighbor-Joining Character-based Maximum Parsimony Maximum Likelihood

More information

BINF6201/8201. Molecular phylogenetic methods

BINF6201/8201. Molecular phylogenetic methods BINF60/80 Molecular phylogenetic methods 0-7-06 Phylogenetics Ø According to the evolutionary theory, all life forms on this planet are related to one another by descent. Ø Traditionally, phylogenetics

More information

Investigation 3: Comparing DNA Sequences to Understand Evolutionary Relationships with BLAST

Investigation 3: Comparing DNA Sequences to Understand Evolutionary Relationships with BLAST Investigation 3: Comparing DNA Sequences to Understand Evolutionary Relationships with BLAST Introduction Bioinformatics is a powerful tool which can be used to determine evolutionary relationships and

More information

Algorithms in Bioinformatics

Algorithms in Bioinformatics Algorithms in Bioinformatics Sami Khuri Department of Computer Science San José State University San José, California, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri Distance Methods Character Methods

More information

ASTRAL: Fast coalescent-based computation of the species tree topology, branch lengths, and local branch support

ASTRAL: Fast coalescent-based computation of the species tree topology, branch lengths, and local branch support ASTRAL: Fast coalescent-based computation of the species tree topology, branch lengths, and local branch support Siavash Mirarab University of California, San Diego Joint work with Tandy Warnow Erfan Sayyari

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION Supplementary information S3 (box) Methods Methods Genome weighting The currently available collection of archaeal and bacterial genomes has a highly biased distribution of isolates across taxa. For example,

More information

Phylogeny and the Tree of Life

Phylogeny and the Tree of Life Chapter 26 Phylogeny and the Tree of Life PowerPoint Lecture Presentations for Biology Eighth Edition Neil Campbell and Jane Reece Lectures by Chris Romero, updated by Erin Barley with contributions from

More information

Integrative Biology 200A "PRINCIPLES OF PHYLOGENETICS" Spring 2008

Integrative Biology 200A PRINCIPLES OF PHYLOGENETICS Spring 2008 Integrative Biology 200A "PRINCIPLES OF PHYLOGENETICS" Spring 2008 University of California, Berkeley B.D. Mishler March 18, 2008. Phylogenetic Trees I: Reconstruction; Models, Algorithms & Assumptions

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION Supplementary information S1 (box). Supplementary Methods description. Prokaryotic Genome Database Archaeal and bacterial genome sequences were downloaded from the NCBI FTP site (ftp://ftp.ncbi.nlm.nih.gov/genomes/all/)

More information

Biology 559R: Introduction to Phylogenetic Comparative Methods Topics for this week (Jan 27 & 29):

Biology 559R: Introduction to Phylogenetic Comparative Methods Topics for this week (Jan 27 & 29): Biology 559R: Introduction to Phylogenetic Comparative Methods Topics for this week (Jan 27 & 29): Statistical estimation of models of sequence evolution Phylogenetic inference using maximum likelihood:

More information

Computational approaches for functional genomics

Computational approaches for functional genomics Computational approaches for functional genomics Kalin Vetsigian October 31, 2001 The rapidly increasing number of completely sequenced genomes have stimulated the development of new methods for finding

More information

Supplementary Information

Supplementary Information Supplementary Information Supplementary Figure 1. Schematic pipeline for single-cell genome assembly, cleaning and annotation. a. The assembly process was optimized to account for multiple cells putatively

More information

Phylogenetics. BIOL 7711 Computational Bioscience

Phylogenetics. BIOL 7711 Computational Bioscience Consortium for Comparative Genomics! University of Colorado School of Medicine Phylogenetics BIOL 7711 Computational Bioscience Biochemistry and Molecular Genetics Computational Bioscience Program Consortium

More information

Phylogeny and the Tree of Life

Phylogeny and the Tree of Life Chapter 26 Phylogeny and the Tree of Life PowerPoint Lecture Presentations for Biology Eighth Edition Neil Campbell and Jane Reece Lectures by Chris Romero, updated by Erin Barley with contributions from

More information

Tree of Life iological Sequence nalysis Chapter http://tolweb.org/tree/ Phylogenetic Prediction ll organisms on Earth have a common ancestor. ll species are related. The relationship is called a phylogeny

More information

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University Phylogenetics: Building Phylogenetic Trees COMP 571 - Fall 2010 Luay Nakhleh, Rice University Four Questions Need to be Answered What data should we use? Which method should we use? Which evolutionary

More information

What to do with a billion years of evolution

What to do with a billion years of evolution What to do with a billion years of evolution Melissa DeBiasse University of Florida Whitney Laboratory for Marine Bioscience @MelissaDeBiasse melissa.debiasse@gmail.com melissadebiasse.weebly.com Acknowledgements

More information

Isolating - A New Resampling Method for Gene Order Data

Isolating - A New Resampling Method for Gene Order Data Isolating - A New Resampling Method for Gene Order Data Jian Shi, William Arndt, Fei Hu and Jijun Tang Abstract The purpose of using resampling methods on phylogenetic data is to estimate the confidence

More information

How to read and make phylogenetic trees Zuzana Starostová

How to read and make phylogenetic trees Zuzana Starostová How to read and make phylogenetic trees Zuzana Starostová How to make phylogenetic trees? Workflow: obtain DNA sequence quality check sequence alignment calculating genetic distances phylogeny estimation

More information

Comparative Bioinformatics Midterm II Fall 2004

Comparative Bioinformatics Midterm II Fall 2004 Comparative Bioinformatics Midterm II Fall 2004 Objective Answer, part I: For each of the following, select the single best answer or completion of the phrase. (3 points each) 1. Deinococcus radiodurans

More information

Using Phylogenomics to Predict Novel Fungal Pathogenicity Genes

Using Phylogenomics to Predict Novel Fungal Pathogenicity Genes Using Phylogenomics to Predict Novel Fungal Pathogenicity Genes David DeCaprio, Ying Li, Hung Nguyen (sequenced Ascomycetes genomes courtesy of the Broad Institute) Phylogenomics Combining whole genome

More information

Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Information #

Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Information # Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Details of PRF Methodology In the Poisson Random Field PRF) model, it is assumed that non-synonymous mutations at a given gene are either

More information

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks! Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks! Paul has many great tools for teaching phylogenetics at his web site: http://hydrodictyon.eeb.uconn.edu/people/plewis

More information

USING BLAST TO IDENTIFY PROTEINS THAT ARE EVOLUTIONARILY RELATED ACROSS SPECIES

USING BLAST TO IDENTIFY PROTEINS THAT ARE EVOLUTIONARILY RELATED ACROSS SPECIES USING BLAST TO IDENTIFY PROTEINS THAT ARE EVOLUTIONARILY RELATED ACROSS SPECIES HOW CAN BIOINFORMATICS BE USED AS A TOOL TO DETERMINE EVOLUTIONARY RELATIONSHPS AND TO BETTER UNDERSTAND PROTEIN HERITAGE?

More information

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

Molecular phylogeny How to infer phylogenetic trees using molecular sequences Molecular phylogeny How to infer phylogenetic trees using molecular sequences ore Samuelsson Nov 2009 Applications of phylogenetic methods Reconstruction of evolutionary history / Resolving taxonomy issues

More information

What is Phylogenetics

What is Phylogenetics What is Phylogenetics Phylogenetics is the area of research concerned with finding the genetic connections and relationships between species. The basic idea is to compare specific characters (features)

More information

SEQUENCING NUCLEAR MARKERS IN FRESHWATER GREEN ALGAE: CHARA SUBSECTION WILLDENOWIA

SEQUENCING NUCLEAR MARKERS IN FRESHWATER GREEN ALGAE: CHARA SUBSECTION WILLDENOWIA SEQUENCING NUCLEAR MARKERS IN FRESHWATER GREEN ALGAE: CHARA SUBSECTION WILLDENOWIA Stephen D. Gottschalk Department of Biological Sciences, Fordham University, 441 E Fordham Rd, Bronx, NY 10458, USA ABSTRACT

More information

Phylogeny 9/8/2014. Evolutionary Relationships. Data Supporting Phylogeny. Chapter 26

Phylogeny 9/8/2014. Evolutionary Relationships. Data Supporting Phylogeny. Chapter 26 Phylogeny Chapter 26 Taxonomy Taxonomy: ordered division of organisms into categories based on a set of characteristics used to assess similarities and differences Carolus Linnaeus developed binomial nomenclature,

More information

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

Molecular phylogeny How to infer phylogenetic trees using molecular sequences Molecular phylogeny How to infer phylogenetic trees using molecular sequences ore Samuelsson Nov 200 Applications of phylogenetic methods Reconstruction of evolutionary history / Resolving taxonomy issues

More information

Phylogenomics. Jeffrey P. Townsend Department of Ecology and Evolutionary Biology Yale University. Tuesday, January 29, 13

Phylogenomics. Jeffrey P. Townsend Department of Ecology and Evolutionary Biology Yale University. Tuesday, January 29, 13 Phylogenomics Jeffrey P. Townsend Department of Ecology and Evolutionary Biology Yale University How may we improve our inferences? How may we improve our inferences? Inferences Data How may we improve

More information

Phylogenetics: Building Phylogenetic Trees

Phylogenetics: Building Phylogenetic Trees 1 Phylogenetics: Building Phylogenetic Trees COMP 571 Luay Nakhleh, Rice University 2 Four Questions Need to be Answered What data should we use? Which method should we use? Which evolutionary model should

More information

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University Phylogenetics: Bayesian Phylogenetic Analysis COMP 571 - Spring 2015 Luay Nakhleh, Rice University Bayes Rule P(X = x Y = y) = P(X = x, Y = y) P(Y = y) = P(X = x)p(y = y X = x) P x P(X = x 0 )P(Y = y X

More information

Bio 1B Lecture Outline (please print and bring along) Fall, 2007

Bio 1B Lecture Outline (please print and bring along) Fall, 2007 Bio 1B Lecture Outline (please print and bring along) Fall, 2007 B.D. Mishler, Dept. of Integrative Biology 2-6810, bmishler@berkeley.edu Evolution lecture #5 -- Molecular genetics and molecular evolution

More information

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010 BLAST Database Searching BME 110: CompBio Tools Todd Lowe April 8, 2010 Admin Reading: Read chapter 7, and the NCBI Blast Guide and tutorial http://www.ncbi.nlm.nih.gov/blast/why.shtml Read Chapter 8 for

More information

Comparative Genomics II

Comparative Genomics II Comparative Genomics II Advances in Bioinformatics and Genomics GEN 240B Jason Stajich May 19 Comparative Genomics II Slide 1/31 Outline Introduction Gene Families Pairwise Methods Phylogenetic Methods

More information

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Homology Modeling. Roberto Lins EPFL - summer semester 2005 Homology Modeling Roberto Lins EPFL - summer semester 2005 Disclaimer: course material is mainly taken from: P.E. Bourne & H Weissig, Structural Bioinformatics; C.A. Orengo, D.T. Jones & J.M. Thornton,

More information

Reconstructing the history of lineages

Reconstructing the history of lineages Reconstructing the history of lineages Class outline Systematics Phylogenetic systematics Phylogenetic trees and maps Class outline Definitions Systematics Phylogenetic systematics/cladistics Systematics

More information

Phylogenetic Trees. Phylogenetic Trees Five. Phylogeny: Inference Tool. Phylogeny Terminology. Picture of Last Quagga. Importance of Phylogeny 5.

Phylogenetic Trees. Phylogenetic Trees Five. Phylogeny: Inference Tool. Phylogeny Terminology. Picture of Last Quagga. Importance of Phylogeny 5. Five Sami Khuri Department of Computer Science San José State University San José, California, USA sami.khuri@sjsu.edu v Distance Methods v Character Methods v Molecular Clock v UPGMA v Maximum Parsimony

More information

Classification and Phylogeny

Classification and Phylogeny Classification and Phylogeny The diversity of life is great. To communicate about it, there must be a scheme for organization. There are many species that would be difficult to organize without a scheme

More information

Lecture 11 Friday, October 21, 2011

Lecture 11 Friday, October 21, 2011 Lecture 11 Friday, October 21, 2011 Phylogenetic tree (phylogeny) Darwin and classification: In the Origin, Darwin said that descent from a common ancestral species could explain why the Linnaean system

More information

Fine-Scale Phylogenetic Discordance across the House Mouse Genome

Fine-Scale Phylogenetic Discordance across the House Mouse Genome Fine-Scale Phylogenetic Discordance across the House Mouse Genome Michael A. White 1,Cécile Ané 2,3, Colin N. Dewey 4,5,6, Bret R. Larget 2,3, Bret A. Payseur 1 * 1 Laboratory of Genetics, University of

More information

Phylogeny and the Tree of Life

Phylogeny and the Tree of Life LECTURE PRESENTATIONS For CAMPBELL BIOLOGY, NINTH EDITION Jane B. Reece, Lisa A. Urry, Michael L. Cain, Steven A. Wasserman, Peter V. Minorsky, Robert B. Jackson Chapter 26 Phylogeny and the Tree of Life

More information

Classification and Phylogeny

Classification and Phylogeny Classification and Phylogeny The diversity it of life is great. To communicate about it, there must be a scheme for organization. There are many species that would be difficult to organize without a scheme

More information

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT 3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT.03.239 25.09.2012 SEQUENCE ANALYSIS IS IMPORTANT FOR... Prediction of function Gene finding the process of identifying the regions of genomic DNA that encode

More information

Missing data and influential sites: choice of sites for phylogenetic analysis can be as important as taxon-sampling and model choice

Missing data and influential sites: choice of sites for phylogenetic analysis can be as important as taxon-sampling and model choice Genome Biology and Evolution Advance Access published March 6, 2013 doi:10.1093/gbe/evt032 Submission date: February 27, 2013 Letter Running Head: Missing data and influential sites Missing data and influential

More information

PGA: A Program for Genome Annotation by Comparative Analysis of. Maximum Likelihood Phylogenies of Genes and Species

PGA: A Program for Genome Annotation by Comparative Analysis of. Maximum Likelihood Phylogenies of Genes and Species PGA: A Program for Genome Annotation by Comparative Analysis of Maximum Likelihood Phylogenies of Genes and Species Paulo Bandiera-Paiva 1 and Marcelo R.S. Briones 2 1 Departmento de Informática em Saúde

More information

The practice of naming and classifying organisms is called taxonomy.

The practice of naming and classifying organisms is called taxonomy. Chapter 18 Key Idea: Biologists use taxonomic systems to organize their knowledge of organisms. These systems attempt to provide consistent ways to name and categorize organisms. The practice of naming

More information

Bootstrap confidence levels for phylogenetic trees B. Efron, E. Halloran, and S. Holmes, 1996

Bootstrap confidence levels for phylogenetic trees B. Efron, E. Halloran, and S. Holmes, 1996 Bootstrap confidence levels for phylogenetic trees B. Efron, E. Halloran, and S. Holmes, 1996 Following Confidence limits on phylogenies: an approach using the bootstrap, J. Felsenstein, 1985 1 I. Short

More information

AP Biology. Cladistics

AP Biology. Cladistics Cladistics Kingdom Summary Review slide Review slide Classification Old 5 Kingdom system Eukaryote Monera, Protists, Plants, Fungi, Animals New 3 Domain system reflects a greater understanding of evolution

More information

Today's project. Test input data Six alignments (from six independent markers) of Curcuma species

Today's project. Test input data Six alignments (from six independent markers) of Curcuma species DNA sequences II Analyses of multiple sequence data datasets, incongruence tests, gene trees vs. species tree reconstruction, networks, detection of hybrid species DNA sequences II Test of congruence of

More information

Fast coalescent-based branch support using local quartet frequencies

Fast coalescent-based branch support using local quartet frequencies Fast coalescent-based branch support using local quartet frequencies Molecular Biology and Evolution (2016) 33 (7): 1654 68 Erfan Sayyari, Siavash Mirarab University of California, San Diego (ECE) anzee

More information