Methods to reconstruct phylogene1c networks accoun1ng for ILS Céline Scornavacca some slides have been kindly provided by Fabio Pardi ISE-M, Equipe Phylogénie & Evolu1on Moléculaires Montpellier, France February the 2nd, 2017
Phylogene1c networks Trees displayed by a network In a phylogene1c network, a re1culate event is represented as a re1cula1on, where branches converge to give rise to a new lineage: a b c d e f
Phylogene1c networks Trees displayed by a network In a phylogene1c network, a re1culate event is represented as a re1cula1on, where branches converge to give rise to a new lineage: The genome(s) at the start of the new lineage is (are) a composi1on of those of the parent lineages. a b c d e f
Phylogene1c networks Trees displayed by a network In a phylogene1c network, a re1culate event is represented as a re1cula1on, where branches converge to give rise to a new lineage: The genome(s) at the start of the new lineage is (are) a composi1on of those of the parent lineages. The evolu1on of each part independently inherited is described by a gene tree a b c d e f
Phylogene1c networks Trees displayed by a network In a phylogene1c network, a re1culate event is represented as a re1cula1on, where branches converge to give rise to a new lineage: The genome(s) at the start of the new lineage is (are) a composi1on of those of the parent lineages. The evolu1on of each part independently inherited is described by a gene tree a b c d e f a b c d e f
Phylogene1c networks Trees displayed by a network In a phylogene1c network, a re1culate event is represented as a re1cula1on, where branches converge to give rise to a new lineage: The genome(s) at the start of the new lineage is (are) a composi1on of those of the parent lineages. The evolu1on of each part independently inherited is described by a gene tree a b c d e f a b c d e f a b c d e f
Phylogene1c networks Trees displayed by a network In a phylogene1c network, a re1culate event is represented as a re1cula1on, where branches converge to give rise to a new lineage: The genome(s) at the start of the new lineage is (are) a composi1on of those of the parent lineages. The evolu1on of each part independently inherited is described by a gene tree a b c d e f In the absence of deep coalescence and allopolyploidy, the gene trees are a subset of the trees displayed by the network a b c d e f a b c d e f
Phylogene1c networks Phylogene1c network inference N a b c d e f?
Phylogene1c networks Phylogene1c network inference An op1miza1on problem where a candidate network is evaluated on the basis of how well the trees it displays fit the data: N a b c d e f Many possible formula4ons: a b c d e f a b c d e f Data: Trees with 3 taxa: (inferred from other data) a b c Goal: Find the network N with the lower hybridiza1on number such that the triplets are `consistent with one of the trees displayed by N subject to constraints on the complexity of N c f a d e f d f b a c d...
Phylogene1c networks Phylogene1c network inference An op1miza1on problem where a candidate network is evaluated on the basis of how well the trees it displays fit the data: N a b c d e f Many possible formula4ons: a b c d e f a b c d e f Data: Any trees on the same taxa: (inferred from other data) Goal: Find the network N with the lower hybridiza1on number such that the input trees are `consistent with one of the trees displayed by N subject to constraints on the complexity of N a c d e f c f a b d e f...
Phylogene1c networks Phylogene1c network inference An op1miza1on problem where a candidate network is evaluated on the basis of how well the trees it displays fit the data: N a b c d e f Many possible formula4ons: a b c d e f a b c d e f Data: Clusters of taxa: {a, b}, {d, e}, {d, e, f}, {a, b, c, d, e, f}, {e, f}, {c, d, e, f},... Goal: Find the network N with the lower hybridiza1on number such that the input clusters are `explained by one of the trees displayed by N subject to constraints on the complexity of N
Phylogene1c networks Phylogene1c network inference An op1miza1on problem where a candidate network is evaluated on the basis of how well the trees it displays fit the data: N a b c d e f Many possible formula4ons: a b c d e f a b c d e f Data: Sequence alignments: (typically given in blocks) T (N) : Goal: mx Find N that minimizes F (N A 1,A 2,...,A m )= min F (T A i) T 2T (N) i=1 subject to constraints on the complexity of N. F() is the parsimony score. Jin et al. Parsimony Score of Phylogene1c Networks: Hardness Results and a Linear-Time Heuris1c. TCCB. 2009....
Jin et al.maximum likelihood of phylogene1c networks. Bioinforma1cs 2006. Phylogene1c networks Phylogene1c network inference An op1miza1on problem where a candidate network is evaluated on the basis of how well the trees it displays fit the data: N p 1-p a b c d e f T (N) : Many possible formula4ons: a b c d e f a b c d e f Data: Sequence alignments: (typically given in blocks) Goal: Find N that maximises 0 1 my my Pr(A 1,A 2,...,A m N) = Pr(A i N) = @ X Pr(A i T )Pr(T N) A... i=1 i=1 T 2T (N)
Jin et al.maximum likelihood of phylogene1c networks. Bioinforma1cs 2006. Phylogene1c networks Phylogene1c network inference An op1miza1on problem where a candidate network is evaluated on the basis of how well the trees it displays fit the data: h^ps://phylomeme1c.wordpress.com/2015/04/17/phylodag/ h^p://old-bioinfo.cs.rice.edu/nepal/ T (N) : p 1-p N a b c d e f Many possible formula4ons: a b c d e f a b c d e f Data: Sequence alignments: (typically given in blocks) Goal: Find N that maximises 0 1 my my Pr(A 1,A 2,...,A m N) = Pr(A i N) = @ X Pr(A i T )Pr(T N) A... i=1 i=1 T 2T (N)
Problems Some issues Searching the space of phylogene1c networks The space of networks with k re1cula1ons is infinite. Controlling for Model Complexity Because any network with k re1cula1ons provides a more complex model than any network with (k-1) re1cula1ons, we must handle the model selec1on problem (AIC, BIC, K-fold cross-valida1on, ). Iden1fiability issues Pr(A 1,A 2,...,A m N) = my Pr(A i N) = i=1 0 my @ Not accoun1ng for ILS and allopolyploidy i=1 X T 2T (N) 1 Pr(A i T )Pr(T N) A
Iden1fiability problems Different networks can display the same trees Some networks display exactly the same trees: N 1 N 2 d d a b c a b c T 1 T 2 T 3 d d d a b c a b c a b c
Iden1fiability problems Different networks can display the same trees Some networks display exactly the same trees: N 1 N 2 Because N 1 and N 2 display the same trees, they are equally good to any of the inference methods we saw no ma9er the input data T 1 a d d b c a b T 2 T 3 d d d c a b c a b c a b c (Recall that a network is evaluated on the basis of how well the trees it displays fit the data) Data
Iden1fiability problems Different networks can display the same trees Some networks display exactly the same trees: N 1 N 2 Because N 1 and N 2 display the same trees, they are equally good to any of the inference methods we saw no ma9er the input data T 1 a d d d b c a b T 2 T 3 UNIDENTIFIABILITY d d c a b c a b c a b c
Iden1fiability problems Indis1nguishable networks Branch lengths do not eliminate noniden1fiability 5.2.05 5.05.05.05 N 1 N 2 5 2.07 5.05.03 5.05.05 a b c d e a b c d e.2 5.2.2.25.2.05.05 a b c d e.35.2.2 5.05.05 a b c d e.35.25.05 5.05 a d b c e N 1 and N 2 display the same trees (i.e. including branch lengths) and are thus indis4nguishable even to methods accoun1ng for lengths
Iden1fiability problems Indis1nguishable networks Branch lengths do not eliminate noniden1fiability The same hold for inheritance probabili1es 5 p 1-p.2.05 5.05.05.05 N 1 N 2 5 2.07 5.05.03 5.05.05 a b c d e a b c d e.2 5.2.2.25.2.05.05 a b c d e.35.2.2 5.05.05 a b c d e.35.25.05 5.05 a d b c e N 1 and N 2 display the same trees (i.e. including branch lengths) and are thus indis4nguishable even to methods accoun1ng for lengths
Canonical networks Unzipping a network Key observa1on: we can move re1cula1ons up or down (un1l they hit a specia1on node) and the trees displayed by a network remain the same:.2 5.05.05.05.05 5 a b c d e N 1 5.2.05 Δ 5.05.05.05 a b c d e
Canonical networks Unzipping a network Key observa1on: we can move re1cula1ons up or down (un1l they hit a specia1on node) and the trees displayed by a network remain the same:.2 5.05.05.05.05 5 a b c d e N 1 5 5 N*.2.05 Δ 5.05.05.05 a b c d e.2.2.2 5 5.05.05 a b c d e
Canonical networks Unzipping a network Key observa1on: we can move re1cula1ons up or down (un1l they hit a specia1on node) and the trees displayed by a network remain the same: N 1 5.2.05 5.05.05.05 a b c d e N 2 5 2.07.2 5.05.03 5.05.05 a b c d e 5 5 N*.2.05 Δ 5.05.05.05 a b c d e.2.2.2 5 5.05.05 a b c d e
Canonical networks Unzipping a network Key observa1on: we can move re1cula1ons up or down (un1l they hit a specia1on node) and the trees displayed by a network remain the same: N 1 5.2.05 5.05.05.05 a b c d e N 2 5 2.07.2 5.05.03 5.05.05 a b c d e By moving re1cula1ons always down, N 1 and N 2 both end up becoming the same network. 5 N* True in general: indis1nguishable networks always transform into the same network the canonical form of N 1 and N 2..2.2.2 5 5.05.05 a b c d e
Canonical networks What it means for the evolu1onary biologist If N is reconstructed by a "classic" inference method, then even assuming perfect and unlimited data, the best you can hope is that the true phylogene1c network is just one of the many that are indis1nguishable from N 5.2.05 5.05.05.05 a b c d e N 5.2.2.2 5 5.05.05 a b c d e N* The canonical form of N is a unique representa1ve of the networks indis1nguishable from N, that excludes their unrecoverable aspects
Canonical networks Take home message for the computational biologist If N is reconstructed by a "classic" inference method, then even assuming perfect and unlimited data, the best you can hope is that the true phylogene1c network is just one of the many that are indis1nguishable from N Canonical networks Network space Classes of indistinguishable networks "Classic" network inference methods should only a^empt to reconstruct canonical networks.
Dealing with deep coalescence Are gene trees always displayed by the network? The approaches above assume that deep coalescence cannot occur, so gene trees are necessarily displayed trees, but this is not always the case Currently accepted introgression scenario among modern humans, Neanderthals and Denisovans, based on sequenced (ancient) nuclear genomes: Phylogene1c tree for the mitochondrial genome (Krause et al Nature 2010): 1.04 Myr (779 kyr 1.3 Myr) 1 466 kyr (321 618 kyr) 1 1 1 1 0.99 1 1 1 1 1 Afr. Mbuti 2 2 Afr. Mbuti 3 Afr. Hausa 4 Afr. San 5 Afr. Ibo 2 6 Afr. San 2 7 Afr. Mbenzele 8 Afr. Biaka 2 9 Afr. Mbenzele 2 10 Afr. Biaka 11 Afr. Kikuyu 12 Afr. Ibo 13 Afr. Mandenka 14 Afr. Effik 15 Afr. Effik 2 16 Afr. Ewondo 17 Afr. Lisongo 18 Afr. Yoruba 19 Afr. Bamileke 20 Afr. Yoruba 21 Asia Evenki 22 Asia Khirgiz 23 Asia Buriat 24 Am. Warao 2 25 Am. Warao 26 New Guinea Coast 27 Aust. Aborigine 3 Modern 28 Asia Chinese 29 Asia Japanese humans 30 Asia Siberian 31 Asia Japanese 32 Am. Guarani 33 Asia Indian 34 Afr. Mkamba 35 Asia Chukchi 36 New Guinea Coast 2 37 New Guinea High 38 New Guinea High 2 13 39 Eur. Italian 40 EMH Kostenki 14 41 Asia Uzbek 42 Asia Korean 43 Polynesia Samoan 44 Am. Piman 45 Eur. German 46 Asia Georgian 47 Asia Chinese 48 Eur. English 49 Eur. Saami 50 Eur. French 51 Asia Crimean Tatar 52 Eur. Dutch 53 rcrs 54 Aust. Aborigine 55 Aust. Aborigine 2 56 Mezmaiskaya 57 Vi336 58 Feldhofer 2 59 Neanderthals Sidron 1253 60 Vi33.25 61 Feldhofer 1 62 Denisova Denisova
ILS An example of incomplete lineage sor1ng (ILS) (a) Population view (b) Reconciliation representation
ILS ILS and the probability of gene trees A B C A B C A B C a b c b a c Rannal & Yang 2003 (Genealogies) Degnan & Salter 2005 (Topologies) c a b
ILS ILS and the probability of gene trees A B C A B C A B C a b c b a c c a b
ILS ILS and the probability of gene trees a b c d a c d b b c a d b d c a a b c d a b d c a d b c b c d a c d a b a c b d a c b d a d c b b d a c c d b a a d b c
ILS Es1ma1ng the species tree under the coalescent model Since standard phylogene1c methods can be inconsistent under ILS (Degnan & Rosenberg 2006), new methods have been developed to cope for this bias, eg: STEM (Kubatko, Carstens & Knowles 2009) STAR (Liu, Yu, Pearl & Edwards 2009) MP-EST (Liu, Yu & Edwards 2010) They es1mate a phylogene1c species tree given a sample of gene trees (with branch lengths or not) under the coalescent model.
ILS Es1ma1ng the species tree under the coalescent model Since standard phylogene1c methods can be inconsistent under ILS (Degnan & Rosenberg 2006), new methods have been developed to cope for this bias, eg: STEM (Kubatko, Carstens & Knowles 2009) STAR (Liu, Yu, Pearl & Edwards 2009) MP-EST (Liu, Yu & Edwards 2010) The same can be done with networks! A B C Yu et al. PNAS 2014
Phylogene1c networks + ILS Phylogene1c network inference under ILS N a b c d e f Pr(A 1,A 2,...,A m N) = my my X Pr(A i N) = Pr(A i T )Pr(T N). i=1 i=1 T 2T (N) Y Z?
Phylogene1c networks + ILS Phylogene1c network inference under ILS N a b c d e f my my X Pr(A 1,A 2,...,A m N) = Pr(A i N) = Pr(A i T )Pr(T N). i=1 i=1 T 2T (N) my Z Pr(A 1,A 2,...,A m N) = i=1 Pr(A 1,A 2,...,A m N) = G Pr(A i )p(g N). Y m i=1 p(g i N).?
Phylogene1c networks + ILS Phylogene1c network inference under ILS Y Yu & L Nakhleh's group : Yu et al. BMC Bioinforma1cs 2013 (without branch lengths) Yu et al. PNAS 2014 (with branch lengths) Wen el al. PLOS Gene1cs 2016 (Bayesian method) PhyloNet h^p://bioinfo.cs.rice.edu/phylonet Y Y X But also other groups contributed significantly, e.g. L Kubabtko's group and C Ane s group. Y Z Pr(A 1,A 2,...,A m N) = my p(g i N). i=1
Dealing with deep coalescence Are gene trees always displayed by the network? The approaches above assume that deep coalescence cannot occur, so gene trees are necessarily displayed trees, but this is not always the case: Currently accepted introgression scenario among modern humans, Neanderthals and Denisovans, based on sequenced (ancient) nuclear genomes:
Dealing with deep coalescence Are gene trees always displayed by the network? The approaches above assume that deep coalescence cannot occur, so gene trees are necessarily displayed trees, but this is not always the case: Currently accepted introgression scenario among modern humans, Neanderthals and Denisovans, based on sequenced (ancient) nuclear genomes: Phylogene1c tree for the mitochondrial genome (Krause et al Nature 2010): 1.04 Myr (779 kyr 1.3 Myr) 1 466 kyr (321 618 kyr) 1 1 1 1 0.99 1 1 1 1 1 Afr. Mbuti 2 2 Afr. Mbuti 3 Afr. Hausa 4 Afr. San 5 Afr. Ibo 2 6 Afr. San 2 7 Afr. Mbenzele 8 Afr. Biaka 2 9 Afr. Mbenzele 2 10 Afr. Biaka 11 Afr. Kikuyu 12 Afr. Ibo 13 Afr. Mandenka 14 Afr. Effik 15 Afr. Effik 2 16 Afr. Ewondo 17 Afr. Lisongo 18 Afr. Yoruba 19 Afr. Bamileke 20 Afr. Yoruba 21 Asia Evenki 22 Asia Khirgiz 23 Asia Buriat 24 Am. Warao 2 25 Am. Warao 26 New Guinea Coast 27 Aust. Aborigine 3 Modern 28 Asia Chinese 29 Asia Japanese humans 30 Asia Siberian 31 Asia Japanese 32 Am. Guarani 33 Asia Indian 34 Afr. Mkamba 35 Asia Chukchi 36 New Guinea Coast 2 37 New Guinea High 38 New Guinea High 2 13 39 Eur. Italian 40 EMH Kostenki 14 41 Asia Uzbek 42 Asia Korean 43 Polynesia Samoan 44 Am. Piman 45 Eur. German 46 Asia Georgian 47 Asia Chinese 48 Eur. English 49 Eur. Saami 50 Eur. French 51 Asia Crimean Tatar 52 Eur. Dutch 53 rcrs 54 Aust. Aborigine 55 Aust. Aborigine 2 56 Mezmaiskaya 57 Vi336 58 Feldhofer 2 59 Neanderthals Sidron 1253 60 Vi33.25 61 Feldhofer 1 62 Denisova Denisova
Dealing with deep coalescence Deep coalescence (ILS) can solve indis1nguishability The following two scenarios would be indis1nguishable on the basis of the trees they display: A B C A B C * *
Dealing with deep coalescence Deep coalescence (ILS) can solve indis1nguishability The following two scenarios would be indis1nguishable on the basis of the trees they display: Imagine sampling two alleles from the hybrid species * In absence of ILS the only observable gene trees are: A B C A B C * *
Dealing with deep coalescence Deep coalescence (ILS) can solve indis1nguishability The following two scenarios would be indis1nguishable on the basis of the trees they display: Imagine sampling two alleles from the hybrid species * If the two alleles from * do not coalesce, other three gene trees can be observed: A B C A B C * *
Dealing with deep coalescence Deep coalescence (ILS) can solve indis1nguishability The following two scenarios would be indis1nguishable on the basis of the trees they display: Imagine sampling two alleles If this is the from the hybrid species * true network If the two alleles from * do not coalesce, other three gene trees can be observed: A B C A B C * * mostly these
Dealing with deep coalescence Deep coalescence (ILS) can solve indis1nguishability The following two scenarios would be indis1nguishable on the basis of the trees they display: Imagine sampling two alleles If this is the from the hybrid species * true network If the two alleles from * do not coalesce, other three gene trees can be observed: A B C A B C * * mostly these
Dealing with deep coalescence Network Mul1-Species Coalescent (NMSC) Given a network with branch lengths (in coalescent units) and inheritance probabili1es, we can calculate the probability of every possible gene tree: Pr = ¼ Pr = ⅛ Pr = ⅛ Pr = ¼ Pr = ¼ ½ ½ ½ ½ 0 A * B C Pr = 0
Dealing with deep coalescence Network Mul1-Species Coalescent (NMSC) Given a network with branch lengths (in coalescent units) and inheritance probabili1es, we can calculate the probability of every possible gene tree: Pr = ¼ Pr = ⅛ Pr = ⅛ Pr = ¼ Pr = ¼ ½ ½ ½ ½ 0 A * B C The probability of a gene tree T is a weighted average of the probabili1es of T under a number of "parental trees" (14 here)
Dealing with deep coalescence Network Mul1-Species Coalescent (NMSC) Given a network with branch lengths (in coalescent units) and inheritance probabili1es, we can calculate the probability of every possible gene tree: Pr = ¼ Pr = ⅛ Pr = ⅛ Pr = ¼ Pr = ¼ ½ ½ ½ ½ 0 A * B C This may solve the iden1fiability issues for several prac1cal cases. Note that gene trees can fail to be displayed by the network also because of allopolyploidy.
Problems Some issues Searching the space of phylogene1c networks The space of networks with k re1cula1ons is infinite. Controlling for Model Complexity Because any network with k re1cula1ons provides a more complex model than any network with (k-1) re1cula1ons, we must handle the model selec1on problem (AIC, BIC, K-fold cross-valida1on, ). Iden1fiability issues Not accoun1ng for ILS and allopolyploidy
h^p://phylnet.univ-mlv.fr/ h^p://phylonetworks.blogspot.fr THANKS FOR YOUR ATTENTION