Daniel H. Huson Stockholm, May 28, 2005
Copyright (c) 2008 Daniel Huson. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license can be found at http://www.gnu.org/copyleft/fdl.html
1. Phylogenetic trees 2. Consensus networks and super networks 3. Hybridization and reticulate networks 4. Recombination networks
Splits networks Phylogenetic trees Reticulate networks Other types of phylogenetic networks Median networks Consensus (super) networks Hybridization networks Special case: Galled trees Recombination networks Augmented trees Split decomposition, Neighbor-net Ancestor recombination graphs Any graph representing evolutionary data
Splits networks 1 Phylogenetic trees Reticulate networks Other types of phylogenetic networks Median networks Consensus (super) networks 2 3 4 Hybridization networks Special case: Galled trees Recombination networks Augmented trees Split decomposition, Neighbor-net Ancestor recombination graphs Any graph representing evolutionary data
1. Phylogenetic trees 2. Consensus networks and super networks 3. Hybridization and reticulate networks 4. Recombination networks
Ernst Haeckel, Tree of Life 1866
Let X = {x 1,...,x n } denote a set of taxa. A phylogenetic tree T (or X-tree) ) is given by labeling the leaves of a tree by the set X: Cow F i n Wha l e B l ue Wha l e Habo r Se a l Ra t Mous e Ch i mp Human Go r i l l a Taxa + tree phylogenetic tree
Unrooted tree mathematically and algorithmically easier to deal with Rooted tree, rooted using Chicken as outgroup biologically relevant, defines clades of related taxa
Each branch e of a phylogenetic tree T may be scaled to represent r t, the rate of evolution r time t along e: 0.01 Chicken Seal Blue Whale Mouse Fin Whale Seal Cow Rat Chimp Human Gorilla root
Sequences evolve along a pre-given tree T, called the evolutionary -, model - or the true tree Two types of events: mutations and speciation events
time Evolutionary tree Sequence of common ancestor Mutations along branches Speciation events at nodes
Tree? Evolutionary tree
(Doolittle, 2000)
1. Phylogenetic trees 2. Consensus networks and super networks 3. Hybridization and reticulate networks 4. Recombination networks
Also allow gene duplication and loss: x 1 x 2 x 3 x x x A A B x 1 x 2 x 3 Gene duplication Gene Tree Species Tree
Differing gene trees give rise to mosaic sequences Gene A Gene B Gene C Gene D
For a given set of species, different genes lead to different trees How to form a consensus of the trees? Consensus trees Consensus networks Consensus super networks
Every edge of a tree defines a split of the taxon set X: x 6 x 1 x 4 x 8 e x 5 x 2 x 7 x 3 x 1,x 3,x 4,x 6,x 7 vs x 2,x 5,x 8
Tree T: Split encoding (T): 5 trivial splits: 2 non-trivial splits:
Two splits A 1 B 1 and A 2 B 2 of X are compatible,, if {A 1 A 2, A 1 B 2,B 1 A 2,B 1 A 2 } Two compatible splits: A 1 B 1 x 4 A 2 B 2 x 2 x 3 x 7 x 8 x 1 x 5 x 6 x 9 X
Two splits A 1 B 1 and A 2 B 2 of X are compatible,, if {A 1 A 2,A 1 B 2,B 1 A 2,B 1 A 2 } Two splits: A 1 B 1 x 4 x 5 A 2 B 2 x 6 x 2 x 1 x 7 x 3 X
Consider the following two trees T 1 and T 2, for which the splits are incompatible: T 1 + T 2 SN( ) The splits network SN( ) represents the incompatible set of splits := (T 1 ) (T 2 ), using bands of parallel edges for incompatible splits.
Given trees T 1,,T k Define (p):={s all : {i: S (T i )} >pk} Strict consensus: strict = * (1/1) Majority consensus: maj = (1/2) In general, ( 1/(d+1)) ) defines a set of consensus splits for d 0
Six gene trees: 1/2): majority consensus: splits contained in more than 50% of trees 1/6): splits contained in more than one tree 0): splits contained in at least one tree
Partial trees for five plant genes Joint work with Kim McBreen and Pete Lockhart Super network
Idea: Extend partial splits. Z-rule: A 1 A 2 A 1 A 1 A 2 B 1 B 2 B 1 B 2 B 2, Repeatedly apply to completion. A 2 B 1 Return all full splits. A 1 [Huson, Dezulian, Kloepper and Steel, 2004] B 2
Five fungal trees from [Pryor, 2000] and [Pryor, 2003] Trees: ITS (two trees) SSU (two trees) Gpd (one tree) Numbers of taxa differ: partial trees
ITS00 46 taxa
ITS03 40 taxa
SSU00 29 taxa
SSU03 40 taxa
Gpd03 40 taxa
Z-closure: a fast super-network method
ITS00+ ITS03
ITS03+ SSU00
ITS00+ ITS00+ SSU03
ITS00+ ITS03+ SSU03+ Gpd03
ITS00+ ITS03+ SSU00+ SSU03+ Gpd03
1. Phylogenetic trees 2. Consensus networks and super networks 3. Hybridization and reticulate networks 4. Recombination networks
Occurs when two organisms from different species interbreed and combine their genomes Copyright 2003 University of Illinois Copyright 2003 University of Illinois Copyright 2003 University of Illinois Water hemp Hybrid Pigs weed
There are a number of known mechanisms by which bacteria can exchange genes Transformation Conjugation transduction http://www.pitt.edu/~heh1/research.html
b 1 a h c b 3 P Q Tree for gene g 1 g 1 Ancestral genome
b 1 a c b 3 P Q g 1 -tree is P -variant g 1
b 1 a c b 3 g 1 -tree is P -variant
b 1 a h c b 3 P Q Tree for gene g 2 g 2
b 1 a h c b 3 P Q g2-tree is Q -variant g 2
b 1 a h c b 3 g2-tree is Q -variant
The evolutionary history associated with any given gene is a tree A network N with k reticulations gives rise to 2 k different gene trees b 1 a h c b 3 b 1 a h c b 3 P Q b 1 a h c b 3 P-tree N Q-tree
Note, however that the two choices P i and Q i can lead to the same tree topology: Here, both induced trees are of the form: ((a,h),(b,c)).
Definition Let X be a set of taxa. A rooted reticulate network N on X is a connected, directed acyclic graph with: precisely one node of indegree 0, the root, all other nodes are tree nodes of indegree 1, or reticulation nodes of indegree 2, every edge is a tree edge joining two tree nodes, or a reticulation edge from a tree node to a reticulation node, and the set of leaves consists of tree nodes and is labeled by X.
a b c d e f g h r 1 r 3 r 2 root
Given a set of trees ={T 1,...,T m }, want to determine the reticulate network N from which the trees were sampled with = = T(N). This form of the problem is not always solvable, e.g. if some of the 2 k possible trees are missing. Thus we consider the following:
Given a set of trees, determine a reticulate network N such that T(N) and N contains a minimum number of reticulation nodes. In fully generality, this is known to be a computationally hard problem [Wang et al 2001].
Reticulation nodes r i, r j N are independent, if they are not contained in a common cycle: r 1 r 2 r 3 Independent reticulations also called galls and a network only containing galls is also called a galled tree [Gusfield et al. 2003]
Observation [Maddison 1997]: : If N contains only one reticulation r, then it corresponds to a sub-tree prune and regraft operation: Reticulate network N: r SPR
Given two bifurcating trees, compute their SPR distance: If = 0, return the tree If = 1, return the reticulate network Else, return fail Generalized to networks with multiple independent reticulations [Nakhleh et al 2004]
A new splits-based approach [Huson, Kloepper, Lockhart and Steel 2005]: gene tree1 gene tree2 splits network of all splits reticulate network
Two reticulations four different gene trees all splits Reticulate network that induces all input trees
Input trees all splits Reticulate network that induces all input trees
Each incompatibility component can be considered independently: 1. component 2. component [Gusfield & Bansal, 2005] [Huson, Kloepper, Lockhart & Steel, 2005]
Consider a component:
Find decomposition R B as a set of reticulate taxa and backbone taxa
Necessary condition: splits restricted to B must correspond to a tree
Consider all possible choices for R of size [Gusfield et al., 2003, 2004]: R={t 7 } not a tree, R not good
Consider all possible choices for R of size [Gusfield et al., 2003, 2004]: R={c} not a tree, R not good
Consider all possible choices for R of size [Gusfield et al., 2003, 2004]: R={t 5 } not a tree, R not good
Consider all possible choices for R of size : R={t 6,t 7 } not a tree, R not good
Consider all possible choices for R of size : R={b,c} is a,, R is a candidate
For R={b,c}, check that the reticulation cycles overlap correctly along a path:
Modify splits network to represent reticulations:
Input: Set of trees, not necessarily bifurcating, can be partial trees Parameter k Output: All reticulate networks N for which every incompatibility component can be explained by at most k overlapping reticulations Complexity: polynomial for fixed k
New Zealand Ranunculus (buttercup) species Nuclear ITS region Chloroplast J SA region
New Zealand Ranunculus (buttercup) species four splits here This Current splits algorithms network are However, interactive suggests sensitive that to false R.nivicola removal of five confusing may branches be a hybrid in the input splits and one taxon leads of trees the and evolutionary here initially to the detection of an lineages no reticulation the is left- and appropriate reticulation. right-hand detected. hand sides. Splits network for both genes Reticulate network
1. Phylogenetic trees 2. Consensus networks and super networks 3. Hybridization and reticulate networks 4. Recombination networks
Recombination is studied in population genetics [24, 20,16, 46, 47, 48] and there ancestor recombination graphs (ARGs) are used for statistical purposes.
We will study the combinatorial aspects of chromosomal (meiotic) recombination and will consider recombination networks rather than ARGs. Simplifying assumptions: all sequences have a common ancestor, and any position can mutate at most once.
: 001101 100000 : 010101 000000 : 100110 000000 : 000000 110100 : 000000 111010 : 100110 000000 b : 010101 000000 r : 001101 100000 c : 000000 110100 d : 000000 111010 o : 000000 000000 000101 000000 000100 000000 000101 100000 outgroup 000000 000000 000000 000000 root 000000 110000 000000 100000
This leads to the following approach: Determine the set of all input splits Determine the connected components of the incompatibility graph or splits network Analyze each component C separately: If C can be explained by a reticulate network N(C), then locally replace C by N(C)
For an alignment A of binary sequences of length n, a recombination network R is a reticulation network N, together with [7]: a labeling of all nodes by binary sequences of length n, such that the leaves of R are labeled by A, a labeling of each tree edge e by the positions that mutate along e, and a labeling of each reticulation node r determining the recombination at r.
Note: the placement of mutations on edges is not uniquely defined. Here, the mutation at position 5 can happen along two different edges: a : 101 010 r : 110 100 b : 000 101 a : 101 010 r : 110 100 b : 000 101 3, 5 100 000 2 100 100 3 1 000 000 4 6 000 100 100 010 3 2 100 100 3 1, 5 4 000 000 Current algorithms [18, 30] place such ambiguous mutations outside of the reticulation cycle, as in (a). 6 000 100
Tree-based approach [Gusfield for computing galled trees: For each component: [Gusfield et al. 2003] If so, arrange taxa in gall Return description of network
Splits-based approach for computing overlapping networks: approach [Huson & Kloepper 2005] Determine a reticulate network as described above. Compute the labeling of nodes and edges.
000000000000 o:000000000000 000000100000 1,5 000101000000 4 6 2 000100000000 7 8 000000110000 000100100000 10 9,11 a:100110000000 b:010101000000 000101100000 3 c:000000110100 r:001101100000 Labelling of splits network is easy to compute d:000000111010 o:000000000000 1,5 a:100110000000 000000000000 000000100000 Copy labelling to recombination network 000101000000 2 4 6 7 000100000000 7 4,6 8 000000110000 10 9,11 b:010101000000 000101100000 3 c:000000110100 r:001101100000 d:000000111010
Input: Presence (0) or absence (1) of a given restriction site in a 3.2kb region of variable chloroplast DNA in Pistacia [Parfitt & Badenes 1997] [Parfitt & Badenes 1997]: P. l en t i s cus 01110010100000000111000000010000 P. we i nmann nn i f o l i a 11001110100000010111000000010000 P. ch i nens i s 01011000100000000111001100010000 P. i n t ege r r i ma 01011010100000000111001100010000 P. t e r eb i n t hus 00011010000000001111101100010000 P. a t l an t i c a 01011011000000000111001100010000 P. me x i c ana a 01011110100000010111010000010000 P. t e x ana 01011110100000010111010000010000 P. kh i n j uk 01011010000000000111001100010000 P. v e r a 01011010000000000111001100010000 Sch i nus mo l l e 01011010011111100000000011101111
Load this data in to SplitsTree4 and select to obtain:
Combinatorically, this can be explained using only one single-crossover recombination: However: recombination of chloroplast is unlikely
Input: Restriction maps of the rdna cistron (length 10kb) of twelve species of mosquitoes using eight 6bp recognition restriction enzymes [Kumar et al,, 1998]: Aede s a l bop i c t us 11110101010100010101010010 Aede s a egyp t i 11110101000100010101000010 Aede s s e a t o i 11110101010100010101010000 Aede s a vop i c t us 11110101010100010101010010 Aede s a l c a s i d i 11110101010100010101010000 Aede s k a t he r i nens i s 11110101010100010101010000 Aede s po l yne s i ens i s 11110101000100010101010010 Aede s t r i s e r i a t us 10110101000110010101000000 Aede s a t r opa l pus 10110101000100010111000010 Aede s epa c t i us 10110101000100010111000010 Ha emagogus equ i nus 10110101000110010101010000 A r m i ge r e s suba l ba t us 10110101000100010101000000 Cu l e x p i p i ens 11110111000100011101001011 T r i p t e r o i de s bambus a 11110111000100010101000010 Sabe t he s c y aneus 11110101001100010101010000 Anophe l e s a l b i manus s 11011101100101110101110100
This data set was analyzed using different tree- reconstruction methods with inconclusive results. The associated splits network (or median network [Bandelt in this context), with edges labeled by the corresponding mutations: Anopheles_albimanus [Bandelt et al,, 1995] root 10 Aedes_katherinensis Aedes_seatoi Aedes_alcasidi Aedes_flavopictus Aedes_albopictus 25 Aedes_polynesiensis 3,5,9,14-15,21,24 22 7 Tripteroides_bambusa 11 2 17,23,26 Aedes_aegypti Sabethes_cyaneus 13 Culex_pipiens 19 Haemagogus_equinus Aedes_triseriatus Aedes_epactius Aedes_atropalpus Armigeres_subalbatus
Recombination scenarios based on the complete data set look unconvincing. However, trial-and-error removal of two taxa Aedes triseriatus and Armigeres subalbatus gives rise to a simpler splits network: Anopheles albimanus root 3,5,9,14-15,21,24 11 Sabethes cyaneus 10 2 25 Aedes katherinensis Aedes seatoi Aedes alcasidi 13 Haemagogus equinus 22 19 7 Aedes polynesiensis Aedes aegypti Aedes albopictus Aedes flavopictus Aedes epactius Aedes atropalpus 17,23,26 Culex pipiens Tripteroides bambusa
A possible recombination scenario is given by: Anopheles_albimanus root 3,5,9,14-15,21,24 Sabethes_cyaneus 10 11 13 2 25 22 25 10 22,25 2 Haemagogus_equinus 7 19 Aedes_aegypti 17,23,26 Aedes_epactius Aedes_atropalpus Culex_pipiens Tripteroides_bambusa Aedes_polynesiensis Aedes_katherinensis Aedes_seatoi Aedes_alcasidi Aedes_albopictus Aedes_flavopictus Here, Haemagogus equinus appears to arise by a single- crossover recombination, and a second such recombination leads to A.albopictus and A.avopictus.
19 restriction endonucleases were used to analyze patterns of cleavage site variation in the mtdna of Zonotrichia. 7 taxa, 122 characters [Zink et al,, 1991] ' Zono t r i c h i a_ qu e r u l a ' 11100011111110001111001111111110000111100011101010110001111100111111110001001111100111111011110011011111000111000 ' Z. _ a t r i c a p i l l a ' 11100011111100001100001111111100000111110011100010111001111100111111110001001111110111111011111011011011110011110 ' Z. _ l e u c oph r y s ' 11100011111100001100001111111100000111110011110010111001111100111111110001001111110111111011111011011011110011110 ' Z. _ a l b i c o l l i s ' 11100011111100001100001011111101000011101011100010111101111110111111111001001111100111110011110011011011010011010 ' Z. _c apens i s - - Bo l i v i a ' 01110011100100001000001111111000110110000011101010111101111100110111100101101111100011111001110110111011000011000 ' Z. _c apens i s - - Co s t a_r i c a ' 11100111100101101000001111111000110110000011100010111101111100110111100101111111100011011001110110111011000011000 ' J. _hy ema l i s ' 11101011100100011000110011111000001111000111101111110011100101010101110011001111000001100101110011011011001011001 Recombination of mtdna???
The unrooted splits network for a dataset of restriction sites in the mtdna of Zonotrichia:
The rooted splits network for the same data set, but suppressing all splits that are only supported by one site in the data:
Possible recombination scenario involving two non-independent reticulations:
Incompatible signals in gene trees can be usefully displayed using splits networks A reticulate network may be extracted by combinatorial analysis of individual components Implementations of many tree and network methods are available in