Splits and Phylogenetic Networks Daniel H. Huson aris, June 21, 2005 1
2 Contents 1. Phylogenetic trees 2. Splits networks 3. Consensus networks 4. Hybridization and reticulate networks 5. Recombination networks
Phylogenetic Networks Bandelt (1991): Network displaying evolutionary relationships Splits networks Phylogenetic trees Reticulate networks Other types of phylogenetic networks Median networks Consensus (super) networks Hybridization networks Special case: Galled trees Recombination networks Augmented trees Split decomposition, Neighbor-net Ancestor recombination graphs Any graph representing evolutionary data 3
Phylogenetic Networks 2 Splits networks 1 Phylogenetic trees Reticulate networks Other types of phylogenetic networks Median networks om sequences 3 Consensus (super) networks from trees 4 Hybridization networks Special case: Galled trees 5 Recombination networks Augmented trees Split decomposition, Neighbor-net from distances Ancestor recombination graphs Any graph representing evolutionary data 4
Phylogenetic Networks Dan Gusfield: Phylogenetic network 2 Splits networks 1 Phylogenetic trees Reticulate networks Generalized phylogenetic network Other types of phylogenetic networks Median networks om sequences 3 Consensus (super) networks from trees 4 \ Hybridization networks Special case: Galled trees 5 Recombination networks More generalized Augmented trees Phylogenetic network Split decomposition, Neighbor-net from distances Ancestor recombination graphs Any graph representing evolutionary data 5
6 Part I 1. Phylogenetic trees 2. Splits networks 3. Consensus networks 4. Hybridization and reticulate networks 5. Recombination networks
7 Phylogenetic Trees Ernst Haeckel, Tree of Life 1866
Phylogenetic Trees Let X = {x 1,...,x n } denote a set of taxa. A phylogenetic tree T (or X-tree) ) is given by labeling the leaves of a tree by the set X: Cow Fin Whale Blue Whale Habor Seal Rat Mouse Chimp Human Gorilla Taxa + tree phylogenetic tree 8
Unrooted vs Rooted Trees Unrooted tree mathematically and algorithmically easier to deal with Rooted tree, rooted using Chicken as outgroup biologically relevant, defines clades of related taxa 9
A Simple Model of Evolution TAA G C CG T ACT C CG A T AC G C C time AC C C A G C T Evolutionary tree AC G A C C C T Sequence of common ancestor Mutations along branches Speciation events at nodes 12
13 Tree Reconstruction Problem TAA G C CG T AC T C CG A T AC G C C Tree? Evolutionary tree
14 Tree of Life Based on 16S rrna (Doolittle, 2000)
15 Part II 1. Phylogenetic trees 2. Splits networks 3. Consensus networks 4. Hybridization and reticulate networks 5. Recombination networks
16 Splits Networks Represents incompatible signals in data, from: Sequences,, e.g.: Median network (Bandelt et al 1994) Spectral analysis (Hendy and Penny 1993) Distances,, e.g.: E.g. Split decomposition (Bandelt Neighbor-Net Trees,, e.g.: Bandelt and Dress 1992) Net (Bryant and Moulton 2002) Consensus network (Holland and Moulton 2003) Super network (H., Dezulian, Kloepper Bootstrap network (H., implemented in SplitsTree4) Kloepper and Steel 2004)
Sequences to Splits Network If characters have only 2 states and not too conflicting: interpret columns as splits and draw full splits network 17
Distances to Splits Network Split decomposition or Neighbor-Net Net produces network from distances 18
Trees to Splits Network A collection of trees can be represented by a consensus network or super network 19
20 Bootstrap Network Draw all splits that have positive bootstrap score
Bill Martin: Splits networks show which signals tree reconstruction methods are fighting over 21 Split Decomposition & Bootstrap Network Compare the result of Split Decomposition with an NJ tree and bootstrap network: A.mellifer A.cerana A.mellifer A.cerana A.mellifer A.cerana orsata A.dorsata A.dorsata A.koschev Bio-NJ tree A.andrenof A.florea A.koschev A.andrenof A.florea Bootstrap network A.koschev A.andrenof A.florea Splits network obtained via the Split Decomposition
Rooted Splits Networks Splits network can be rooted e.g. using an outgroup (Gambette and H, manuscript) 22
Better layout of splits networks Philippe Gambette,, currently visiting Tuebingen from ENS Cachan 23
24 Part III 1. Phylogenetic trees 2. Splits networks 3. Consensus networks 4. Hybridization and reticulate networks 5. Recombination networks
25 Gene Trees Can Differ Also allow gene duplication and loss: x 1 x 2 x 3 A A B x x x A A B x 1 x 2 x 3 Gene duplication A B G Gene Tree Species Tree
26 Gene Trees vs Species Trees Differing gene trees give rise to mosaic sequences Gene A Gene B Gene C Gene D
27 Consensus of Different Gene Trees For a given set of species, different genes lead to different trees How to form a consensus of the trees? Consensus trees Consensus networks Consensus super networks
The Splits of a Tree Every edge of a tree defines a split of the taxon set X: x 6 x 1 x 4 x 8 e x 5 x 2 x 7 x 3 x 1,x 3,x 4,x 6,x 7 vs x 2,x 5,x 8 28
The Split Encoding of a Tree Tree T: c d a b Split encoding Σ(T): e 5 trivial splits: 2 non-trivial splits: 29
30 Compatibility Two splits A 1 B 1 and A 2 B 2 of X are compatible,, if {A 1 A 2,A 1 B 2,B 1 A 2,B 1 A 2 } Two compatible splits: x 4 A 1 B 1 A 2 B 2 x 2 x 3 x 7 x 8 x 1 x 5 x 6 x 9 X
31 Compatibility Two splits A 1 B 1 and A 2 B 2 of X are compatible,, if {A 1 A 2,A 1 B 2,B 1 A 2,B 1 A 2 } Two incompatible splits: A 1 B 1 x 4 x 5 A 2 B 2 x 6 x 2 x 1 x 7 x 3 X
34 Consensus of Trees Six gene trees: Σ(1/2): majority consensus: splits contained in more than 50% of trees Σ(1/6): splits contained in more than one tree Σ(0): splits contained in at least one tree
artial trees for five plant genes Super network 35 Example of A Super Network (Plants)
Z-Closure Method Idea: Extend partial splits. Z-rule: A 1 A 2 A 1, A 1 A 2 B 1 B 2 B 1 B 2 B 2 Repeatedly apply to completion. A 2 Return all full splits. B 1 A 1 [Huson, Dezulian, Kloepper and Steel, 2004] B 2 36
37 Example Five fungal trees from [Pryor, Pryor, 2000] and [Pryor, 2003] Trees: ITS (two trees) SSU (two trees) Gpd (one tree) Numbers of taxa differ: partial trees
Example 4beta25: User manual now available from www.splitstree.org! 38
Individual Gene Trees ITS00 46 taxa 39
Individual Gene Trees ITS03 40 taxa 40
Individual Gene Trees SSU00 29 taxa 41
Individual Gene Trees SSU03 40 taxa 42
Individual Gene Trees Gpd03 40 taxa 43
Z-closure: a fast super-network method 44 Gene Trees as Super Network
45 Gene Trees as Super Network ITS00+ ITS03
46 Gene Trees as Super Network ITS03+ SSU00
47 Gene Trees as Super Network ITS00+ ITS00+ SSU03
48 Gene Trees as Super Network ITS00+ ITS03+ SSU03+ Gpd03
49 Gene Trees as Super Network ITS00+ ITS03+ SSU00+ SSU03+ Gpd03
50 Part IV 1. Phylogenetic trees 2. Splits networks 3. Consensus networks 4. Hybridization and reticulate networks 5. Recombination networks
Hybridization Occurs when two organisms from different species interbreed and combine their genomes Copyright 2003 University of Illinois Copyright 2003 University of Illinois Copyright 2003 University of Illinois Water hemp Hybrid Pigs weed 51
Speciation by Hybridization 1 In allopolyploidization,, two different lineages produce a new species that has the complete nuclear genomes of both parental species: Linder et al. 2004 52
Speciation by Hybridization 2 In diploid (or homoploid) hybrid speciation, each of the parents produces normal gametes (haploid) to produce a normal diploid hybrid: Linder et al. 2004 53
54 Horizontal Gene Transfer There are a number of known mechanisms by which bacteria can exchange genes Transformation Conjugation transduction http://www.pitt.edu/~heh1/research.html
A Simple Model of Reticulate Evolution b 1 a h c b 3 P Q Tree for gene g 1 g 1 Ancestral genome 55
g 1 56 A Simple Model of Reticulate Evolution b 1 a h c b 3 P Q g 1 -tree is P -variant
57 A Simple Model of Reticulate Evolution b 1 a h c b 3 g 1 -tree is P -variant
g 2 58 A Simple Model of Reticulate Evolution b 1 a h c b 3 P Q Tree for gene g 2
g 2 59 A Simple Model of Reticulate Evolution b 1 a h c b 3 P Q g2-tree is Q -variant
60 A Simple Model of Reticulate Evolution b 1 a h c b 3 g2-tree is Q -variant
Reticulate Networks and Trees The evolutionary history associated with any given gene is a tree A network N with k reticulations gives rise to 2 k different gene trees b 1 a h c b 3 b 1 a h c b 3 P Q b 1 a h c b 3 N P-tree Q-tree 61
63 Rooted Reticulate Network Definition Let X be a set of taxa. A rooted reticulate network N on X is a connected, directed acyclic graph with: precisely one node of indegree 0, the root, all other nodes are tree nodes of indegree 1, or reticulation nodes of indegree 2, every edge is a tree edge joining two tree nodes, or a reticulation edge from a tree node to a reticulation node, and the set of leaves consists of tree nodes and is labeled by X.
64 Rooted Reticulate Network a b c d e f g h r 1 r 3 r 2 root
66 Most Parsimonious Network Problem: Given a set of trees T,, determine a reticulate network N such that T T(N) and N contains a minimum number of reticulation nodes. In fully generality, this is known to be a computationally hard problem [Wang et al 2001, Bordewich and Semple 2004].
Independent Reticulations Reticulation nodes r i, r j N are independent, if they are not contained in a common cycle: r 1 r 2 r 3 Independent reticulations also called galls and a network only containing galls is also called a galled tree [Gusfield et al. 2003] 67
SPR's and Reticulations Observation [Maddison 1997]: : If N contains only one reticulation r, then it corresponds to a sub-tree prune and regraft operation: Reticulate network N: r SPR 68
69 SPR-Based Algorithm Given two bifurcating trees, compute their SPR distance: If = 0, return the tree If = 1, return the reticulate network Else, return fail Generalized to networks with multiple independent reticulations [Nakhleh et al 2004] Maximum agreement forest approach (Semple Semple et al 2005)
Splits-Based Approach A new splits-based approach [Huson, [Huson, Kloepper,, Lockhart and Steel 2005]: gene tree1 gene tree2 splits network of all splits reticulate network 70
Multiple Independent Reticulations wo reticulations all splits Reticulate network that induces all input trees 71
Multiple and Overlapping Reticulations Input trees all splits Reticulate network that induces all 72
73 Decomposition Theorem Each incompatibility component can be considered independently: 1. component 2. component
74 Decomposition Theorem Consider a component:
75 Algorithm Find decomposition R B as a set of reticulate taxa and backbone taxa
76 Algorithm Necessary condition: splits restricted to B must correspond to a tree
R={t } not a tree, R not good 77 Algorithm Consider all choices for R of size 1 [Gusfield et al., 2003, 2004]:
R={c} not a tree, R not good 78 Algorithm Consider all choices for R of size 1 [Gusfield et al., 2003, 2004]:
R={t } not a tree, R not good 79 Algorithm Consider all choices for R of size 1 [Gusfield et al., 2003, 2004]:
Algorithm Consider all choices for R of size 2: [H., Kloepper,, Lockhart and Steel, 2005] R={t,t,t } not a tree, R not good 80
Algorithm Consider all choices for R of size 2: [H., Kloepper,, Lockhart and Steel, 2005] R={b,c b,c} is a tree,, R is a candidate 81
82 Check Candidate For R={b,c b,c}, check that reticulation cycles overlap correctly along a path:
83 Network Construction Modify splits network to represent reticulations:
84 Splits-Based Algorithm Input: Set of trees T, not necessarily bifurcating, can be partial trees Parameter k Output: All reticulate networks N for which every incompatibility component can be explained by at most k overlapping reticulations Complexity: polynomial for fixed k
Application to Real Data New Zealand Ranunculus (buttercup) species Nuclear ITS region Chloroplast J SA region 85
Application to Real Data New Zealand Ranunculus (buttercup) species four splits here Current This splits algorithms network are However, interactive sensitive suggests that to false R.nivicola removal of five confusing branches may be a hybrid in the input splits and one taxon leads trees of the and evolutionary here initially to the detection of an no lineages reticulation the is left- and appropriate reticulation. right-hand detected. hand sides. Splits network for both genes Reticulate network 86
87 Part V 1. Phylogenetic trees 2. Splits networks 3. Consensus networks 4. Hybridization and reticulate networks 5. Recombination networks
88 Recombination Recombination is studied in population genetics [24, 20,16, 46, 47, 48] and there ancestor recombination graphs (ARGs)) are used for statistical purposes.
89 Chromosomal Recombination We will study the combinatorial aspects of chromosomal (meiotic) recombination and will consider recombination networks rather than ARGs. Simplifying assumptions: all sequences have a common ancestor, and any position can mutate at most once.
Example of a Recombination Network r:001101 100000 b:010101 000000 a:100110 000000 2 c:000000 110100 3 10 d:000000 111010 9,11 lignment A: :100110 000000 :010101 000000 :001101 100000 :000000 110100 :000000 111010 :000000 000000 000101 000000 1,5 000100 000000 outgroup 000101 100000 6 4 6 7 000000 000000 root 8 000000 100000 000000 000000 000000 110000 90
94 Recombination Network Tree-based approach [Gusfield for computing galled trees: For each component: Gusfield et al. 2003] Determine whether removing one taxon produces a perfect phylogeny If so, arrange taxa in gall Return description of network
95 Recombination Network Splits-based approach [Huson & Kloepper for computing overlapping networks: Determine a reticulate network as described above. Compute the labeling of nodes and edges. Kloepper 2005]
96 Recombination Network [Lungso,, Sun and Hein, to appear in WABI 2005]: Branch and bound approach to computing unrestricted recombination network
98 Example 1, Data Input: Presence (0) or absence (1) of a given restriction site in a 3.2kb region of variable chloroplast DNA in Pistacia [Parfitt & Badenes Badenes 1997]: P.lentiscus 01110010100000000111000000010000 P.weinmannifolia 11001110100000010111000000010000 P.chinensis 01011000100000000111001100010000 P.integerrima 01011010100000000111001100010000 P.terebinthus 00011010000000001111101100010000 P.atlantica 01011011000000000111001100010000 P.mexicana 01011110100000010111010000010000 P.texana 01011110100000010111010000010000 P.khinjuk 01011010000000000111001100010000 P.vera 01011010000000000111001100010000 Schinus molle 01011010011111100000000011101111
99 Example 1, Recombination Network Load this data in to SplitsTree4 and select RecombinationNetwork to obtain:
100 Example 1, Single Crossover Combinatorically,, this can be explained using only one single-crossover recombination:
Example 2, Data Input: Restriction maps of the rdna cistron (length 10kb) of twelve species of mosquitoes using eight 6bp recognition restriction enzymes [Kumar et al,, 1998]: Aedes albopictus 11110101010100010101010010 Aedes aegypti 11110101000100010101000010 Aedes seatoi 11110101010100010101010000 Aedes avopictus 11110101010100010101010010 Aedes alcasidi 11110101010100010101010000 Aedes katherinensis 11110101010100010101010000 Aedes polynesiensis 11110101000100010101010010 Aedes triseriatus 10110101000110010101000000 Aedes atropalpus 10110101000100010111000010 Aedes epactius 10110101000100010111000010 Haemagogus equinus 10110101000110010101010000 Armigeres subalbatus 10110101000100010101000000 Culex pipiens 11110111000100011101001011 Tripteroides bambusa 11110111000100010101000010 Sabethes cyaneus 11110101001100010101010000 Anopheles albimanus 11011101100101110101110100 101
Example 2, Median Network This data set was analyzed using different tree- reconstruction methods with inconclusive results. The associated splits network (or median network [Bandelt in this context), with edges labeled by the corresponding mutations: Anopheles_albimanus Bandelt et al,, 1995] root 10 Aedes_katherinensis Aedes_seatoi Aedes_alcasidi Aedes_flavopictus Aedes_albopictus 25 Aedes_polynesiensis 3,5,9,14-15,21,24 22 7 Tripteroides_bambusa 11 2 17,23,26 Aedes_aegypti Sabethes_cyaneus Culex_pipiens 13 19 Haemagogus_equinus Aedes_epactius Aedes_atropalpus Armigeres_subalbatus Aedes_triseriatus 102
Example 2, Subset Recombination scenarios based on the complete data set look unconvincing. However, trial-and and-error removal of two taxa Aedes triseriatus and Armigeres subalbatus gives rise to a simpler splits network: Anopheles albimanus root Sabethes cyaneus 3,5,9,14-15,21,24 Haemagogus equinus 11 13 Aedes epactius 2 22 19 Aedes atropalpus 10 Aedes aegypti 25 17,23,26 7 Aedes polynesiensis Culex pipiens Tripteroides bambusa Aedes katherinensis Aedes seatoi Aedes alcasidi Aedes albopictus Aedes flavopictus 103
Example 2, Recombination Network A possible recombination scenario is given by: Anopheles_albimanus root Sabethes_cyaneus Haemagogus_equinus Aedes_epactius 3,5,9,14-15,21,24 13 19 22,25 Aedes_atropalpus 11 2 2 Aedes_aegypti 25 22 7 17,23,26 Culex_pipiens 10 25 10 Tripteroides_bambusa Aedes_polynesiensis Aedes_katherinensis Aedes_seatoi Aedes_alcasidi Aedes_albopictus Aedes_flavopictus Here, Haemagogus equinus appears to arise by a single- crossover recombination, and a second such recombination leads to A.albopictus and A.avopictus. 104
Example 3, Data 19 restriction endonucleases were used to analyze patterns of cleavage site variation in the mtdna of Zonotrichia. 7 taxa,, 122 characters [Zink et al,, 1991] notrichia_querula' ' 1110001111111000111100111111111000011110001110101011000111110011111111000100111110011111101111001101111 1110011111111000100111110011111101111001101111 _atricapilla' ' 1110001111110000110000111111110000011111001110001011100111110011111111000100111111011111101111101101101 100111110011111111000100111111011111101111101101101 _leucophrys' ' 1110001111110000110000111111110000011111001111001011100111110011111111000100111111011111101111101101101 1100111110011111111000100111111011111101111101101101 _albicollis' ' 1110001111110000110000101111110100001110101110001011110111111011111111100100111110011111001111001101101 1110111111011111111100100111110011111001111001101101 _capensis-- --Bolivia' 0111001110010000100000111111100011011000001110101011110111110011011110010110111110001111100111011011101 1110111110011011110010110111110001111100111011011101 _capensis-- --Costa_Rica'' 1110011110010110100000111111100011011000001110001011110111110011011110010111111110001101100111011011101 011011110010111111110001101100111011011101 _hyemalis' ' 1110101110010001100011001111100000111100011110111111001110010101010111001100111100000110010111001101101 111001110010101010111001100111100000110010111001101101 However: recombination of mtdna unlikely 105
106 Example 3, Median Network The unrooted splits network for a dataset of restriction sites in the mtdna of Zonotrichia:
107 Example 3, Significant Differences The rooted splits network for the same data set, but suppressing all splits that are only supported by one site in the data:
108 Recombination Network Possible recombination scenario involving two non-independent reticulations:
Summary Incompatible signals in gene trees can be usefully displayed using splits networks A reticulate network may be extracted by combinatorial analysis of individual components Implementations of many tree and network methods are available in SplitsTree4 109
110