Phylogenetic invariants versus classical phylogenetics

Size: px
Start display at page:

Download "Phylogenetic invariants versus classical phylogenetics"

Transcription

1 Phylogenetic invariants versus classical phylogenetics Marta Casanellas Rius (joint work with Jesús Fernández-Sánchez) Departament de Matemàtica Aplicada I Universitat Politècnica de Catalunya Algebraic Statistics 2014 IIT, Chicago May 2014 M. Casanellas (UPC) casanellas Chicago / 55

2 Outline 1 Evolutionary Markov models 2 Phylogenetic invariants 3 Eriksson s method revisited M. Casanellas (UPC) casanellas Chicago / 55

3 Outline 1 Evolutionary Markov models 2 Phylogenetic invariants 3 Eriksson s method revisited M. Casanellas (UPC) casanellas Chicago / 55

4 Outline 1 Evolutionary Markov models 2 Phylogenetic invariants 3 Eriksson s method revisited M. Casanellas (UPC) casanellas Chicago / 55

5 Outline 1 Evolutionary Markov models 2 Phylogenetic invariants 3 Eriksson s method revisited M. Casanellas (UPC) casanellas Chicago / 55

6 Phylogenetic trees M. Casanellas (UPC) casanellas Chicago / 55

7 Phylogenetic reconstruction Given n current species, e.g. n = 3, HUMAN, GORILLA and CHIMP, we want to reconstruct their ancestral relationships (phylogeny): Input: DNA sequences (part of their genome) and an alignment M. Casanellas (UPC) casanellas Chicago / 55

8 Phylogenetic reconstruction Given n current species, e.g. n = 3, HUMAN, GORILLA and CHIMP, we want to reconstruct their ancestral relationships (phylogeny): Input: DNA sequences (part of their genome) and an alignment M. Casanellas (UPC) casanellas Chicago / 55

9 Phylogenetic reconstruction Given n current species, e.g. n = 3, HUMAN, GORILLA and CHIMP, we want to reconstruct their ancestral relationships (phylogeny): Input: DNA sequences (part of their genome) and an alignment AACTTCGAGGCTTACCGCTG... AAGGTCGATGCTCACCGATG... AACGTCTATGCTCACCGATG... M. Casanellas (UPC) casanellas Chicago / 55

10 Phylogenetic reconstruction goals Goals: Reconstruct the correct tree topology (topology of the graph with species labels at the leaves). E.g. 3 different tree topologies of unrooted trees on four leaves : Estimate branch lengths: the amount of substitutions per site that have occurred between the two species at the ends of the branch M. Casanellas (UPC) casanellas Chicago / 55

11 Modelling evolution Assume that all sites of the alignment evolve independently and according to the same process (i.i.d.). Model the evolution of z single site on a tree T. At each node of the tree T one puts a random variable taking values in {A, C, G, T} Variables at the leaves are observed and variables at the interior nodes are hidden. At each branch one writes a substitution matrix with the conditional probabilities of a nucleotide at the parent node being substituted by another at its child. S i = A C G T A C G T P(A A, i) P(C A, i) P(G A, i) P(T A, i) P(A C, i) P(C C, i) P(G C, i) P(T C, i) P(A G, i) P(C G, i) P(G G, i) P(T G, i) P(A T, i) P(C T, i) P(G T, i) P(T T, i) M. Casanellas (UPC) casanellas Chicago / 55

12 Modelling evolution Assume that all sites of the alignment evolve independently and according to the same process (i.i.d.). Model the evolution of z single site on a tree T. At each node of the tree T one puts a random variable taking values in {A, C, G, T} Variables at the leaves are observed and variables at the interior nodes are hidden. At each branch one writes a substitution matrix with the conditional probabilities of a nucleotide at the parent node being substituted by another at its child. S i = A C G T A C G T P(A A, i) P(C A, i) P(G A, i) P(T A, i) P(A C, i) P(C C, i) P(G C, i) P(T C, i) P(A G, i) P(C G, i) P(G G, i) P(T G, i) P(A T, i) P(C T, i) P(G T, i) P(T T, i) M. Casanellas (UPC) casanellas Chicago / 55

13 Modelling evolution Assume that all sites of the alignment evolve independently and according to the same process (i.i.d.). Model the evolution of z single site on a tree T. At each node of the tree T one puts a random variable taking values in {A, C, G, T} Variables at the leaves are observed and variables at the interior nodes are hidden. At each branch one writes a substitution matrix with the conditional probabilities of a nucleotide at the parent node being substituted by another at its child. S i = A C G T A C G T P(A A, i) P(C A, i) P(G A, i) P(T A, i) P(A C, i) P(C C, i) P(G C, i) P(T C, i) P(A G, i) P(C G, i) P(G G, i) P(T G, i) P(A T, i) P(C T, i) P(G T, i) P(T T, i) M. Casanellas (UPC) casanellas Chicago / 55

14 Evolutionary Markov models The entries of S i and the root distribution are unknown parameters. According to the shape of (or group of symmetries in S 4 acting on) S i one has different evolutionary models: general Markov model, strand symmetric model, Kimura 3-parameter, Kimura 2-parameters, Jukes-Cantor model. M. Casanellas (UPC) casanellas Chicago / 55

15 Evolutionary Markov models Jukes-Cantor model Root node has uniform distribution: p A =... = p T = 0.25 S i = A C G T A C G T a i b i b i b i b i a i b i b i b i b i a i b i b i b i b i a i, a i + 3b i = 1 only 2 different parameters (one for mutation and one for non-mutation), only 1 free parameter. M. Casanellas (UPC) casanellas Chicago / 55

16 Evolutionary Markov models Kimura 3-parameter model Root node has uniform distribution: p A =... = p T = 0.25 S i = A C G T A C G T ( ai b i c i d i b i a i d i c i c i d i a i b i d i c i b i a i ), a i + b i + c i + d i = 1 Kimura 2-parameter: b i = d i M. Casanellas (UPC) casanellas Chicago / 55

17 Evolutionary Markov models Strand Symmetric model Root node distribution satisfies: p A = p T, p C = p G S i = A C G T A C G T a i b i c i d i e i f i g i h i h i g i f i e i d i c i b i a i Strand symmetry in a DNA molecule: A T C G M. Casanellas (UPC) casanellas Chicago / 55

18 Evolutionary Markov models General Markov model Root node distribution: arbitrary S i = A C G T A C G T a i b i c i d i e i f i g i h i j i k i l i m i n i o i p i q i All these models are known as equivariant models: they are invariant by permutation of rows and columns of a certain subgroup of G = S 4 (permutations of A,C,G,T). M. Casanellas (UPC) casanellas Chicago / 55

19 Hidden Markov process p x1 x 2 x 3 = Prob(X 1 = x 1, X 2 = x 2, X 3 = x 3 ) = joint probability of the observed variables X 1, X 2, X 3. Then, p x1 x 2 x 3 = p yr S 1 (y r, x 1 )S 4 (y r, y 4 )S 2 (y 4, x 2 )S 3 (y 4, x 3 ) y 4,y r {A,C,G,T} p x1 x 2 x 3 is a homogeneous polynomial on the parameters (entries of the matrices S i and root distribution). M. Casanellas (UPC) casanellas Chicago / 55

20 An evolutionary model M (with d free parameters) on a tree topology T on n leaves defines a map ϕ M T : e E(T ) d 4n 1 ϕ M T : e E(T ) Cd+1 C 4n Phylogenetic variety V M (T ) = Imϕ T, closure in Zariski topology. Given an alignment, we can estimate the joint probability p x1 x 2...x n the relative frequency of column x 1 x 2... x n in the alignment. by In the theoretical model, this would be a point ˆp = (f x1 x 2...x n ) x1 x 2...x n on the variety V M (T ). In the real world, it is only close to V M (T ) for some tree topology T. M. Casanellas (UPC) casanellas Chicago / 55

21 An evolutionary model M (with d free parameters) on a tree topology T on n leaves defines a map ϕ M T : e E(T ) d 4n 1 ϕ M T : e E(T ) Cd+1 C 4n Phylogenetic variety V M (T ) = Imϕ T, closure in Zariski topology. Given an alignment, we can estimate the joint probability p x1 x 2...x n the relative frequency of column x 1 x 2... x n in the alignment. by In the theoretical model, this would be a point ˆp = (f x1 x 2...x n ) x1 x 2...x n on the variety V M (T ). In the real world, it is only close to V M (T ) for some tree topology T. M. Casanellas (UPC) casanellas Chicago / 55

22 An evolutionary model M (with d free parameters) on a tree topology T on n leaves defines a map ϕ M T : e E(T ) d 4n 1 ϕ M T : e E(T ) Cd+1 C 4n Phylogenetic variety V M (T ) = Imϕ T, closure in Zariski topology. Given an alignment, we can estimate the joint probability p x1 x 2...x n the relative frequency of column x 1 x 2... x n in the alignment. by In the theoretical model, this would be a point ˆp = (f x1 x 2...x n ) x1 x 2...x n on the variety V M (T ). In the real world, it is only close to V M (T ) for some tree topology T. Image(ϕ M T ) M. Casanellas (UPC) casanellas Chicago / 55

23 Phylogenetic mixtures Change the i.i.d hypothesis by all positions evolve independently and following the same evolutionary model. Same model M, same tree topology T = T 1 but different substitution parameters. Then, p = (p AA...A,..., p TT...T ) = λϕ M T 1 (S 1,..., S 4 ) + (1 λ)ϕ M T 1 (S 1,..., S 4) That is, p SecV M T 1 M. Casanellas (UPC) casanellas Chicago / 55

24 Phylogenetic mixtures Change the i.i.d hypothesis by all positions evolve independently and following the same evolutionary model. Same model M, different tree topologies and different substitution parameters. Then, p = (p AA...A,..., p TT...T ) = λϕ M T 1 (S 1,..., S 4 ) + (1 λ)ϕ M T 2 (S 1,..., S 4) That is, p V M T 1 V M T 2 (join of two varieties). M. Casanellas (UPC) casanellas Chicago / 55

25 Outline 1 Evolutionary Markov models 2 Phylogenetic invariants 3 Eriksson s method revisited M. Casanellas (UPC) casanellas Chicago / 55

26 Which generators of I (V M (T )) are useful for recovering the tree topology? Some generators depend on the model chosen (not on the tree topology). E.g. for Jukes-Cantor, n = 3: p x1 x 2 x 3 = 1 and p AAA p CCC, p AAA p GGG, p AAA p TTT, p AAC p AAG, p AAC p AAT, p CCA p CCG,..., p ACA p AGA, p ACA p ATA, p CAC p CGC,... p CAA p GAA, p CAA p TAA, p ACC p GCC,... For this model, the polynomials that detect the topology of the three species have degree 3. Definition (Cavender Felsenstein 87) The elements of I (V M (T )) are called invariants. They are called phylogenetic invariants if they distinguish between different tree topologies. M. Casanellas (UPC) casanellas Chicago / 55

27 Which generators of I (V M (T )) are useful for recovering the tree topology? Some generators depend on the model chosen (not on the tree topology). E.g. for Jukes-Cantor, n = 3: p x1 x 2 x 3 = 1 and p AAA p CCC, p AAA p GGG, p AAA p TTT, p AAC p AAG, p AAC p AAT, p CCA p CCG,..., p ACA p AGA, p ACA p ATA, p CAC p CGC,... p CAA p GAA, p CAA p TAA, p ACC p GCC,... For this model, the polynomials that detect the topology of the three species have degree 3. Definition (Cavender Felsenstein 87) The elements of I (V M (T )) are called invariants. They are called phylogenetic invariants if they distinguish between different tree topologies. M. Casanellas (UPC) casanellas Chicago / 55

28 Which generators of I (V M (T )) are useful for recovering the tree topology? Some generators depend on the model chosen (not on the tree topology). E.g. for Jukes-Cantor, n = 3: p x1 x 2 x 3 = 1 and p AAA p CCC, p AAA p GGG, p AAA p TTT, p AAC p AAG, p AAC p AAT, p CCA p CCG,..., p ACA p AGA, p ACA p ATA, p CAC p CGC,... p CAA p GAA, p CAA p TAA, p ACC p GCC,... For this model, the polynomials that detect the topology of the three species have degree 3. Definition (Cavender Felsenstein 87) The elements of I (V M (T )) are called invariants. They are called phylogenetic invariants if they distinguish between different tree topologies. M. Casanellas (UPC) casanellas Chicago / 55

29 Description of the ideal Computational algebra software: fails for 5 leaves. M. Casanellas (UPC) casanellas Chicago / 55

30 Description of the ideal Computational algebra software: fails for 5 leaves. Model M fixed. Theorem (Allman-Rhodes, Draisma-Kuttler, Sturmfels, Sullivant, C) There is an algorithm for obtaining the generators of the ideal of an n-leaved tree T from those of an unrooted tree of 3 leaves and the edges of the tree. I (V (T )) = I v + v Int(T ) e E(T ) I e I v I e M. Casanellas (UPC) casanellas Chicago / 55

31 Invariants from edges Example: M = general Markov model matrix obtained by flattening p = (p AAAA, p AAAC,..., p TTTT ) in 4 C 4 = ( 2 C 4 ) ( 2 C 4 ) : flat e p = states leaves 1, 2 states in leaves 3,4 I e := ideal generated by all 5 5 minors. p AAAA p AAAC p AAAG... p AATT p ACAA p ACAC p ACAG... p ACTT p AGAA p AGAC p AGAG... p AGTT p TTAA p TTAC p TTAG... p TTTT (Allman-Rhodes, Eriksson) I e are phylogenetic invariants for the general Markov model. M. Casanellas (UPC) casanellas Chicago / 55

32 Invariants from edges Example: M = general Markov model matrix obtained by flattening p = (p AAAA, p AAAC,..., p TTTT ) in 4 C 4 = ( 2 C 4 ) ( 2 C 4 ) : flat e p = states leaves 1, 2 states in leaves 3,4 I e := ideal generated by all 5 5 minors. p AAAA p AAAC p AAAG... p AATT p ACAA p ACAC p ACAG... p ACTT p AGAA p AGAC p AGAG... p AGTT p TTAA p TTAC p TTAG... p TTTT (Allman-Rhodes, Eriksson) I e are phylogenetic invariants for the general Markov model. M. Casanellas (UPC) casanellas Chicago / 55

33 Phylogenetic invariants We are just interested in phylogenetic invariants, i.e, generators that vanish in some tree topologies but not all. Ingredients: 1 n given species 2 evolutionary model M is fixed 3 want tree topology relating them We call T = {trivalent tree topologies on n labelled leaves} Assuming that our data p comes from a distribution on some tree T T under the model M, find the correct tree T 0. In algebraic geometry this is: assume p T T V M (T ) and find T 0 such that p V M (T 0 ). M. Casanellas (UPC) casanellas Chicago / 55

34 Phylogenetic invariants We are just interested in phylogenetic invariants, i.e, generators that vanish in some tree topologies but not all. Ingredients: 1 n given species 2 evolutionary model M is fixed 3 want tree topology relating them We call T = {trivalent tree topologies on n labelled leaves} Assuming that our data p comes from a distribution on some tree T T under the model M, find the correct tree T 0. In algebraic geometry this is: assume p T T V M (T ) and find T 0 such that p V M (T 0 ). M. Casanellas (UPC) casanellas Chicago / 55

35 Phylogenetic invariants We are just interested in phylogenetic invariants, i.e, generators that vanish in some tree topologies but not all. Ingredients: 1 n given species 2 evolutionary model M is fixed 3 want tree topology relating them We call T = {trivalent tree topologies on n labelled leaves} Assuming that our data p comes from a distribution on some tree T T under the model M, find the correct tree T 0. In algebraic geometry this is: assume p T T V M (T ) and find T 0 such that p V M (T 0 ). M. Casanellas (UPC) casanellas Chicago / 55

36 Phylogenetic invariants We are just interested in phylogenetic invariants, i.e, generators that vanish in some tree topologies but not all. Ingredients: 1 n given species 2 evolutionary model M is fixed 3 want tree topology relating them We call T = {trivalent tree topologies on n labelled leaves} Assuming that our data p comes from a distribution on some tree T T under the model M, find the correct tree T 0. In algebraic geometry this is: assume p T T V M (T ) and find T 0 such that p V M (T 0 ). M. Casanellas (UPC) casanellas Chicago / 55

37 Phylogenetic invariants Goal: for any particular tree T 0 T, how is the variety V M (T 0 ) defined inside T T V M (T )? (at least in a biologically meaningful open subset). M. Casanellas (UPC) casanellas Chicago / 55

38 Main Theorem Theorem Let M = Jukes-Cantor, Kimura (2 or 3 parameters), Strand Symmetric, General Markov model (or any equivariant model). For topology reconstruction purposes, it is enough to consider invariants from edges. More precisely, for each tree topology T T there exists a dense open subset U T V (T ) such that if a point p belongs to U T, then T T p V (T 0 ) p Z( e E(T 0 ) I e). Theorem (Buneman s Splits equivalence Theorem) A tree can be recovered by knowing all bipartitions induced from its edges. M. Casanellas (UPC) casanellas Chicago / 55

39 Key: Representation theory to include all models Draisma-Kuttler framework: representation theory Example: Jukes-Cantor model. Substitution matrices: S = A C G T A C G T a b b b b a b b b b a b b b b a Any permutation of rows and columns leaves the matrix invariant. Consider G = S 4, permutations of A,C,G,T. Consider the representation of G, ρ : G GL(W ), W =< A, C, G, T > C = C 4, corresponding to the action of G on 4 4-matrices. For instance, (A, C) ( M. Casanellas (UPC) casanellas Chicago / 55 )

40 New description of the parametrization We associate the vector space W = A, C, G, T C to each node in the tree and then each substitution matrix belongs to Hom G (W, W ). Therefore the substitution matrices lie in par G (T ) := e E(T ) Hom G (W, W ) and the map ϕ T parameterizing V M (T ) can be seen as ϕ T : par G (T ) ( n W ) G M. Casanellas (UPC) casanellas Chicago / 55

41 Other models: subgroups of S 4 G = (A, C)(G, T), (A, G)(C, T) = Z 2 Z 2 for Kimura 3-parameter model. G = (AT)(CG) for strand symmetric model. G = id for general Markov model. and in all cases W = A, C, G, T C and ρ : G GL(W ) is the permutation representation. M. Casanellas (UPC) casanellas Chicago / 55

42 Jukes-Cantor If G = S 4 (Jukes-Cantor), then G has 5 irreducible representations M 0,..., M 4. (Maschke): Any G-representation E decomposes into isotypic components, E = 4 i=0 M i C m i, and we associate to it a 5-tuple of multiplicities m(e) = (m 0,..., m 4 ). Our particular representation ρ : G GL(W ) for Jukes-Cantor model decomposes as W = M 0 M 3 and it has m(w ) = (1, 0, 0, 1, 0). M. Casanellas (UPC) casanellas Chicago / 55

43 Jukes-Cantor If G = S 4 (Jukes-Cantor), then G has 5 irreducible representations M 0,..., M 4. (Maschke): Any G-representation E decomposes into isotypic components, E = 4 i=0 M i C m i, and we associate to it a 5-tuple of multiplicities m(e) = (m 0,..., m 4 ). Our particular representation ρ : G GL(W ) for Jukes-Cantor model decomposes as W = M 0 M 3 and it has m(w ) = (1, 0, 0, 1, 0). M. Casanellas (UPC) casanellas Chicago / 55

44 Thin flattening Let L 1 L 2 be a bipartition of the set of leaves {1,..., n}, and denote l 1 = L 1, l 2 = L 2 Let p ( n W ) G. The matrix of the thin flattening of p along L 1 L 2, Tflat L1 L 2 p = (P 0,..., P 4 ), is obtained from p via the isomorphism ( l 1 W l 2 W ) G = HomG ( l 1 W, l 2 W ) = 4 Hom C (C m(l 1 W )i, C m(l 2 W )i ) (last isomorphism is due to Schur s Lemma). Write rktflat L1,L 2 p := (rk P 0,..., rk P 4 ). Rank conditions: rktflat L1,L 2 p m(w ) = (1, 0, 0, 1, 0) i=0 M. Casanellas (UPC) casanellas Chicago / 55

45 Example (Jukes-Cantor) n = 4, L 1 = {1, 2}, L 2 = {3, 4}. Hom G ( W 2, W 2 ) = 4 Hom C (C m( W 2 ) i, C m( W 2 ) i ) i=0 W W = M0 2 M 2 M3 3 M 4, hence m( 2 W ) = (2, 0, 1, 3, 1) and Tflat L1,L 2 p can be written as a block matrix in a certain basis: Tflat L1,L 2 p = rktflat L1,L 2 p m(w ) = (1, 0, 0, 1, 0). M. Casanellas (UPC) casanellas Chicago / 55

46 Invariants for mixtures on the same tree If p is a k-mixture distribution on a tree topology T, p = k i=1 λ iϕ M T (θi ), and A B is a bipartition induced by and edge of T, then rk(flat A B (p)) 4k (for the general Markov model, analogous for other equivariant models) M. Casanellas (UPC) casanellas Chicago / 55

47 Some results on simulations Phylogenetic invariants have been tested on simulated data with not so bad results Have been also tested in quartet-based methods (Rusinko-Hipp, Sumner et al.) But... there is no canonical way of using them. Model invariants have been used for model selection with successful results (+Kedzierska, Guigó) Invariants have been also useful for proving identifiability of certain models (Allman-Rhodes et al.) M. Casanellas (UPC) casanellas Chicago / 55

48 Some results on simulations Phylogenetic invariants have been tested on simulated data with not so bad results Have been also tested in quartet-based methods (Rusinko-Hipp, Sumner et al.) But... there is no canonical way of using them. Model invariants have been used for model selection with successful results (+Kedzierska, Guigó) Invariants have been also useful for proving identifiability of certain models (Allman-Rhodes et al.) M. Casanellas (UPC) casanellas Chicago / 55

49 Some results on simulations Phylogenetic invariants have been tested on simulated data with not so bad results Have been also tested in quartet-based methods (Rusinko-Hipp, Sumner et al.) But... there is no canonical way of using them. Model invariants have been used for model selection with successful results (+Kedzierska, Guigó) Invariants have been also useful for proving identifiability of certain models (Allman-Rhodes et al.) M. Casanellas (UPC) casanellas Chicago / 55

50 Some results on simulations Phylogenetic invariants have been tested on simulated data with not so bad results Have been also tested in quartet-based methods (Rusinko-Hipp, Sumner et al.) But... there is no canonical way of using them. Model invariants have been used for model selection with successful results (+Kedzierska, Guigó) Invariants have been also useful for proving identifiability of certain models (Allman-Rhodes et al.) M. Casanellas (UPC) casanellas Chicago / 55

51 Some results on simulations Phylogenetic invariants have been tested on simulated data with not so bad results Have been also tested in quartet-based methods (Rusinko-Hipp, Sumner et al.) But... there is no canonical way of using them. Model invariants have been used for model selection with successful results (+Kedzierska, Guigó) Invariants have been also useful for proving identifiability of certain models (Allman-Rhodes et al.) M. Casanellas (UPC) casanellas Chicago / 55

52 Are we ready for application to real data??? M. Casanellas (UPC) casanellas Chicago / 55

53 Outline 1 Evolutionary Markov models 2 Phylogenetic invariants 3 Eriksson s method revisited M. Casanellas (UPC) casanellas Chicago / 55

54 Singular value decomposition (SVD) Distance (in Frobenius or 2-norm) of an m n matrix M to R 4 = { m n matrices of rank 4} can be computed easily: SVD: M = UDV t with U M m m, D M m n, V M n n, U, V orthogonal, D almost diagonal D = σ 1 σ σ 1 σ 2 σ m 0 σ m Then min X R4 M X = M X 4 where X 4 = U σ 1 σ 2 σ 3 σ 4 and d 2 (M, R 4 ) = σ 5, d F (M, R 4 ) = i 5 σ2 i V t... M. Casanellas (UPC) casanellas Chicago / 55

55 Singular value decomposition (SVD) Distance (in Frobenius or 2-norm) of an m n matrix M to R 4 = { m n matrices of rank 4} can be computed easily: SVD: M = UDV t with U M m m, D M m n, V M n n, U, V orthogonal, D almost diagonal D = σ 1 σ σ 1 σ 2 σ m 0 σ m Then min X R4 M X = M X 4 where X 4 = U σ 1 σ 2 σ 3 σ 4 and d 2 (M, R 4 ) = σ 5, d F (M, R 4 ) = i 5 σ2 i V t... M. Casanellas (UPC) casanellas Chicago / 55

56 Eriksson s SVD method Use SVD to evaluate how close a bipartition A B is from being an edge split: sc(a B) = d F (flatt A B ( p), R 4 ). Find the best cherry in the tree by looking at all bipartitions of 2 taxa against others. Trait this cherry as an indivisible pair of taxa, and proceed recursively. Example: (1, 4), 5, (2, 3) M. Casanellas (UPC) casanellas Chicago / 55

57 Problems of Eriksson s method: 1 Low performance in the presence of long branches 2 Is it fair to compare bipartitions of different size with this score? 3 What happens when we have small sample size? M. Casanellas (UPC) casanellas Chicago / 55

58 Long branch attraction correct tree usually inferred tree M. Casanellas (UPC) casanellas Chicago / 55

59 Flattening matrix for the wrong topology AA AC CA CC CG GC GG GT TG TT + AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT M. Casanellas (UPC) casanellas Chicago / 55

60 Eriksson s revisited: Erik+2 Given a bipartition A B, consider the transition matrices M A B and M B A instead of the flattening matrix: M ( p) = p AAAA p AA++ p ACAA p AC++ p AGAA p AG++ p AAAC p AA++... p ACAC p AC++... p AGAC p AATT p AA++ p ACTT p AC++ p AGTT p AG++ p AG M 34 12( p) = These matrices have the same rank as the flattening. We redefine the score as: Erik+2: sc(a B) = (d F (M A B, R 4 ) + d F (M B A, R 4 ))/2. ( column sums =1 The lower the score is, the more it is likely that the bipartition comes from an edge of T. M. Casanellas (UPC) casanellas Chicago / 55

61 Assessing performance M. Casanellas (UPC) casanellas Chicago / 55

62 For each (a, b), simulate 100 alignments under topology For each method, represent the percentage of success in a gray scale: white=0%, black=100% success. M. Casanellas (UPC) casanellas Chicago / 55 Assessing performance Huelsenbeck 95: assesses the performance of different phylogenetic reconstruction methods on the following tree space.

63 Performance of Eriksson and Erik+2 M. Casanellas (UPC) casanellas Chicago / 55

64 Performance against other methods M. Casanellas (UPC) casanellas Chicago / 55

65 old invariants M. Casanellas (UPC) casanellas Chicago / 55

66 Erik+2 needs more data but... works under the most general model (under i.i.d. assumptions). Models difference: Continuous-time models: S i = exp(q i t i ), Parameters: Q i rate matrix, t i branch length. They assume local homogeneity:s i (t) = Q is i (t). Usually, also global-homogeneity is assumed: same rate-matrix Q throughout the tree, different t i for each edge i. (Algebraic) Markov models: the entries of the matrices S i are the parameters of the model. Therefore they allow for non-homogeneous data, both local and global (different mutation rates at different lineages of the tree and on a single edge). The results we ve seen so far were on globally homogeneous data. Now... M. Casanellas (UPC) casanellas Chicago / 55

67 Erik+2 needs more data but... works under the most general model (under i.i.d. assumptions). Models difference: Continuous-time models: S i = exp(q i t i ), Parameters: Q i rate matrix, t i branch length. They assume local homogeneity:s i (t) = Q is i (t). Usually, also global-homogeneity is assumed: same rate-matrix Q throughout the tree, different t i for each edge i. (Algebraic) Markov models: the entries of the matrices S i are the parameters of the model. Therefore they allow for non-homogeneous data, both local and global (different mutation rates at different lineages of the tree and on a single edge). The results we ve seen so far were on globally homogeneous data. Now... M. Casanellas (UPC) casanellas Chicago / 55

68 95% 95% 95% 95% Erik+2 Length=1000 Erik+2 Length= % mean= ; sd= % 95% % % 95% parameter b % 95% 95% % % 95% 95% % 0.2 mean= ; sd= NJ Length=1000 NJ Length= % mean= ; sd= % parameter b % 95% 95% 95% 95% % % 95% 95% 95% 95% 95% % 95% mean= ; sd= ML Length=1000 ML Length= mean= ; sd= mean= ; sd= parameter b % 95% % % % 95% % % M. Casanellas (UPC) casanellas Chicago / 55

69 Erik+2 is more robust: rates across sites M. Casanellas (UPC) casanellas Chicago / 55

70 Erik+2 on 2- mixtures Kolaczkowski, Thornton. Nature 2004 M. Casanellas (UPC) casanellas Chicago / 55

71 How about larger trees? Scores are not normalized, we cannot compare bipartitions of different size Eriksson s method involved comparison of different bipartitions Also, some penalization should be given: n = 6, requiring a matrix to have rank 4 or requiring a matrix having rank 4 is not the same. Indeed, codimension of R 4, : (42 4)(4 4 4) = 3024, codimension of R 4,4 3 43: 3600 M. Casanellas (UPC) casanellas Chicago / 55

72 How about larger trees? Scores are not normalized, we cannot compare bipartitions of different size Eriksson s method involved comparison of different bipartitions Also, some penalization should be given: n = 6, requiring a matrix to have rank 4 or requiring a matrix having rank 4 is not the same. Indeed, codimension of R 4, : (42 4)(4 4 4) = 3024, codimension of R 4,4 3 43: 3600 M. Casanellas (UPC) casanellas Chicago / 55

73 Normalizing the score But in the meanwhile... M. Casanellas (UPC) casanellas Chicago / 55

74 Normalizing the score But in the meanwhile... M. Casanellas (UPC) casanellas Chicago / 55

75 Weighted quartet-based methods Weight each quartet tree T 1 = 12 34, T 2 = 13 24, T 3 = according to the reconstruction method (weight reliability): Erik+2: w(t i ) = sc(t i ) 1 /(sc(t 1 ) 1 + sc(t 2 ) 1 + sc(t 3 ) 1 ) ML: w(t i ) = logl(t i ) 1 /(logl(t 1 ) 1 + logl(t 2 ) 1 + logl(t 3 ) 1 ) Barycentric coordinates: (w(t 1 ), w(t 2 ), w(t 3 )) M. Casanellas (UPC) casanellas Chicago / 55

76 Barycentric plots (w(t 1 ), w(t 2 ), w(t 3 )) M. Casanellas (UPC) casanellas Chicago / 55

77 Barycentric plots (w(t 1 ), w(t 2 ), w(t 3 )) M. Casanellas (UPC) casanellas Chicago / 55

78 Quartet-based methods (Weight optimization) Where do we put leaf 5? Gascuel-Ranwez 2001 M. Casanellas (UPC) casanellas Chicago / 55

79 Topology evaluated; b: internal branch length, leaf edges proportional to it M. Casanellas (UPC) casanellas Chicago / 55

80 Method Quartets b = b = b = 0.05 b = 0.1 Average QP Erik (100) 0.07 (100) 1.69 (100) 4.57 (100) 1.6 ML 9 (100) 9.08 (93) 9.65 (93) - (0) - NJ 9.7 (100) 9.65 (100) 9.91 (100) 9.91 (100) 9.79 WIL Erik (100) 0.07 (100) 0.33 (100) 1.52 (100) 0.48 ML 9.97 (100) 9.28 (93) 9.27 (93) - (0) - NJ 7.66 (100) 7.41 (100) 8.25 (100) 9.61 (100) 8.23 WO Erik (100) 0.06 (100) 0.2 (100) 1.66 (100) 0.48 ML (100) (93) (93) - (0) - NJ 9.46 (100) 9.22 (100) 9.2 (100) 9.73 (100) 9.4 Global NJ Table : Average Robinson-Foulds distance to the correct tree, alignment length 5000, GMM data. M. Casanellas (UPC) casanellas Chicago / 55

81 Method Quartets b = b = b = 0.05 b = 0.1 Average QP Erik+2 0 (100) 0 (100) 0.44 (100) 2.92 (100) 0.84 ML 9 (100) 9.11 (96) 9.6 (91) 9 (1) 9.18 NJ 9.69 (100) 9.97 (100) 9.92 (100) 9.95 (100) 9.88 WIL Erik+2 0 (100) 0.02 (100) 0.09 (100) 0.44 (100) 0.14 ML 9.33 (100) 9 (96) 8.98 (91) 9 (1) 9.08 NJ 7.46 (100) 7.44 (100) 8.03 (100) 9.48 (100) 8.1 WO Erik+2 0 (100) 0 (100) 0.08 (100) 0.51 (100) 0.15 ML (100) (96) (91) 13 (1) NJ 9.41 (100) 9.18 (100) 9.33 (100) 9.62 (100) 9.38 Global NJ Table : Average Robinson-Foulds distance to the correct tree, alignment length 10000, GMM data M. Casanellas (UPC) casanellas Chicago / 55

82 Open problems Algorithm for larger trees Incorporate Allman-Rhodes-Taylor semialgebraic conditions Look for closest rank 4 nonnegative matrix Statistical analysis of the method M. Casanellas (UPC) casanellas Chicago / 55

83 Open problems Algorithm for larger trees Incorporate Allman-Rhodes-Taylor semialgebraic conditions Look for closest rank 4 nonnegative matrix Statistical analysis of the method M. Casanellas (UPC) casanellas Chicago / 55

84 Thanks! M. Casanellas (UPC) casanellas Chicago / 55

Using algebraic geometry for phylogenetic reconstruction

Using algebraic geometry for phylogenetic reconstruction Using algebraic geometry for phylogenetic reconstruction Marta Casanellas i Rius (joint work with Jesús Fernández-Sánchez) Departament de Matemàtica Aplicada I Universitat Politècnica de Catalunya IMA

More information

Algebraic Statistics Tutorial I

Algebraic Statistics Tutorial I Algebraic Statistics Tutorial I Seth Sullivant North Carolina State University June 9, 2012 Seth Sullivant (NCSU) Algebraic Statistics June 9, 2012 1 / 34 Introduction to Algebraic Geometry Let R[p] =

More information

Phylogenetic Algebraic Geometry

Phylogenetic Algebraic Geometry Phylogenetic Algebraic Geometry Seth Sullivant North Carolina State University January 4, 2012 Seth Sullivant (NCSU) Phylogenetic Algebraic Geometry January 4, 2012 1 / 28 Phylogenetics Problem Given a

More information

Introduction to Algebraic Statistics

Introduction to Algebraic Statistics Introduction to Algebraic Statistics Seth Sullivant North Carolina State University January 5, 2017 Seth Sullivant (NCSU) Algebraic Statistics January 5, 2017 1 / 28 What is Algebraic Statistics? Observation

More information

Example: Hardy-Weinberg Equilibrium. Algebraic Statistics Tutorial I. Phylogenetics. Main Point of This Tutorial. Model-Based Phylogenetics

Example: Hardy-Weinberg Equilibrium. Algebraic Statistics Tutorial I. Phylogenetics. Main Point of This Tutorial. Model-Based Phylogenetics Example: Hardy-Weinberg Equilibrium Algebraic Statistics Tutorial I Seth Sullivant North Carolina State University July 22, 2012 Suppose a gene has two alleles, a and A. If allele a occurs in the population

More information

PHYLOGENETIC ALGEBRAIC GEOMETRY

PHYLOGENETIC ALGEBRAIC GEOMETRY PHYLOGENETIC ALGEBRAIC GEOMETRY NICHOLAS ERIKSSON, KRISTIAN RANESTAD, BERND STURMFELS, AND SETH SULLIVANT Abstract. Phylogenetic algebraic geometry is concerned with certain complex projective algebraic

More information

arxiv: v1 [q-bio.pe] 27 Oct 2011

arxiv: v1 [q-bio.pe] 27 Oct 2011 INVARIANT BASED QUARTET PUZZLING JOE RUSINKO AND BRIAN HIPP arxiv:1110.6194v1 [q-bio.pe] 27 Oct 2011 Abstract. Traditional Quartet Puzzling algorithms use maximum likelihood methods to reconstruct quartet

More information

The Generalized Neighbor Joining method

The Generalized Neighbor Joining method The Generalized Neighbor Joining method Ruriko Yoshida Dept. of Mathematics Duke University Joint work with Dan Levy and Lior Pachter www.math.duke.edu/ ruriko data mining 1 Challenge We would like to

More information

Species Tree Inference using SVDquartets

Species Tree Inference using SVDquartets Species Tree Inference using SVDquartets Laura Kubatko and Dave Swofford May 19, 2015 Laura Kubatko SVDquartets May 19, 2015 1 / 11 SVDquartets In this tutorial, we ll discuss several different data types:

More information

19 Tree Construction using Singular Value Decomposition

19 Tree Construction using Singular Value Decomposition 19 Tree Construction using Singular Value Decomposition Nicholas Eriksson We present a new, statistically consistent algorithm for phylogenetic tree construction that uses the algeraic theory of statistical

More information

Phylogenetic Tree Reconstruction

Phylogenetic Tree Reconstruction I519 Introduction to Bioinformatics, 2011 Phylogenetic Tree Reconstruction Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Evolution theory Speciation Evolution of new organisms is driven

More information

Metric learning for phylogenetic invariants

Metric learning for phylogenetic invariants Metric learning for phylogenetic invariants Nicholas Eriksson nke@stanford.edu Department of Statistics, Stanford University, Stanford, CA 94305-4065 Yuan Yao yuany@math.stanford.edu Department of Mathematics,

More information

Jed Chou. April 13, 2015

Jed Chou. April 13, 2015 of of CS598 AGB April 13, 2015 Overview of 1 2 3 4 5 Competing Approaches of Two competing approaches to species tree inference: Summary methods: estimate a tree on each gene alignment then combine gene

More information

Lie Markov models. Jeremy Sumner. School of Physical Sciences University of Tasmania, Australia

Lie Markov models. Jeremy Sumner. School of Physical Sciences University of Tasmania, Australia Lie Markov models Jeremy Sumner School of Physical Sciences University of Tasmania, Australia Stochastic Modelling Meets Phylogenetics, UTAS, November 2015 Jeremy Sumner Lie Markov models 1 / 23 The theory

More information

ELIZABETH S. ALLMAN and JOHN A. RHODES ABSTRACT 1. INTRODUCTION

ELIZABETH S. ALLMAN and JOHN A. RHODES ABSTRACT 1. INTRODUCTION JOURNAL OF COMPUTATIONAL BIOLOGY Volume 13, Number 5, 2006 Mary Ann Liebert, Inc. Pp. 1101 1113 The Identifiability of Tree Topology for Phylogenetic Models, Including Covarion and Mixture Models ELIZABETH

More information

EVOLUTIONARY DISTANCES

EVOLUTIONARY DISTANCES EVOLUTIONARY DISTANCES FROM STRINGS TO TREES Luca Bortolussi 1 1 Dipartimento di Matematica ed Informatica Università degli studi di Trieste luca@dmi.units.it Trieste, 14 th November 2007 OUTLINE 1 STRINGS:

More information

Practical Bioinformatics

Practical Bioinformatics 5/2/2017 Dictionaries d i c t i o n a r y = { A : T, T : A, G : C, C : G } d i c t i o n a r y [ G ] d i c t i o n a r y [ N ] = N d i c t i o n a r y. h a s k e y ( C ) Dictionaries g e n e t i c C o

More information

Phylogenetic Geometry

Phylogenetic Geometry Phylogenetic Geometry Ruth Davidson University of Illinois Urbana-Champaign Department of Mathematics Mathematics and Statistics Seminar Washington State University-Vancouver September 26, 2016 Phylogenies

More information

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM) Bioinformatics II Probability and Statistics Universität Zürich and ETH Zürich Spring Semester 2009 Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM) Dr Fraser Daly adapted from

More information

Constructing Evolutionary/Phylogenetic Trees

Constructing Evolutionary/Phylogenetic Trees Constructing Evolutionary/Phylogenetic Trees 2 broad categories: istance-based methods Ultrametric Additive: UPGMA Transformed istance Neighbor-Joining Character-based Maximum Parsimony Maximum Likelihood

More information

Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM).

Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM). 1 Bioinformatics: In-depth PROBABILITY & STATISTICS Spring Semester 2011 University of Zürich and ETH Zürich Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM). Dr. Stefanie Muff

More information

Regulatory Sequence Analysis. Sequence models (Bernoulli and Markov models)

Regulatory Sequence Analysis. Sequence models (Bernoulli and Markov models) Regulatory Sequence Analysis Sequence models (Bernoulli and Markov models) 1 Why do we need random models? Any pattern discovery relies on an underlying model to estimate the random expectation. This model

More information

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic analysis Phylogenetic Basics: Biological

More information

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive. Additive distances Let T be a tree on leaf set S and let w : E R + be an edge-weighting of T, and assume T has no nodes of degree two. Let D ij = e P ij w(e), where P ij is the path in T from i to j. Then

More information

Reconstruire le passé biologique modèles, méthodes, performances, limites

Reconstruire le passé biologique modèles, méthodes, performances, limites Reconstruire le passé biologique modèles, méthodes, performances, limites Olivier Gascuel Centre de Bioinformatique, Biostatistique et Biologie Intégrative C3BI USR 3756 Institut Pasteur & CNRS Reconstruire

More information

Dr. Amira A. AL-Hosary

Dr. Amira A. AL-Hosary Phylogenetic analysis Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic Basics: Biological

More information

Lecture 3: Markov chains.

Lecture 3: Markov chains. 1 BIOINFORMATIK II PROBABILITY & STATISTICS Summer semester 2008 The University of Zürich and ETH Zürich Lecture 3: Markov chains. Prof. Andrew Barbour Dr. Nicolas Pétrélis Adapted from a course by Dr.

More information

Modelling and Analysis in Bioinformatics. Lecture 1: Genomic k-mer Statistics

Modelling and Analysis in Bioinformatics. Lecture 1: Genomic k-mer Statistics 582746 Modelling and Analysis in Bioinformatics Lecture 1: Genomic k-mer Statistics Juha Kärkkäinen 06.09.2016 Outline Course introduction Genomic k-mers 1-Mers 2-Mers 3-Mers k-mers for Larger k Outline

More information

Phylogenetics: Parsimony and Likelihood. COMP Spring 2016 Luay Nakhleh, Rice University

Phylogenetics: Parsimony and Likelihood. COMP Spring 2016 Luay Nakhleh, Rice University Phylogenetics: Parsimony and Likelihood COMP 571 - Spring 2016 Luay Nakhleh, Rice University The Problem Input: Multiple alignment of a set S of sequences Output: Tree T leaf-labeled with S Assumptions

More information

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics - in deriving a phylogeny our goal is simply to reconstruct the historical relationships between a group of taxa. - before we review the

More information

Constructing Evolutionary/Phylogenetic Trees

Constructing Evolutionary/Phylogenetic Trees Constructing Evolutionary/Phylogenetic Trees 2 broad categories: Distance-based methods Ultrametric Additive: UPGMA Transformed Distance Neighbor-Joining Character-based Maximum Parsimony Maximum Likelihood

More information

Phylogenetic Assumptions

Phylogenetic Assumptions Substitution Models and the Phylogenetic Assumptions Vivek Jayaswal Lars S. Jermiin COMMONWEALTH OF AUSTRALIA Copyright htregulation WARNING This material has been reproduced and communicated to you by

More information

Phylogenetic inference

Phylogenetic inference Phylogenetic inference Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, March 7 th 016 After this lecture, you can discuss (dis-) advantages of different information types

More information

Phylogeny of Mixture Models

Phylogeny of Mixture Models Phylogeny of Mixture Models Daniel Štefankovič Department of Computer Science University of Rochester joint work with Eric Vigoda College of Computing Georgia Institute of Technology Outline Introduction

More information

TheDisk-Covering MethodforTree Reconstruction

TheDisk-Covering MethodforTree Reconstruction TheDisk-Covering MethodforTree Reconstruction Daniel Huson PACM, Princeton University Bonn, 1998 1 Copyright (c) 2008 Daniel Huson. Permission is granted to copy, distribute and/or modify this document

More information

Open Problems in Algebraic Statistics

Open Problems in Algebraic Statistics Open Problems inalgebraic Statistics p. Open Problems in Algebraic Statistics BERND STURMFELS UNIVERSITY OF CALIFORNIA, BERKELEY and TECHNISCHE UNIVERSITÄT BERLIN Advertisement Oberwolfach Seminar Algebraic

More information

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University Phylogenetics: Distance Methods COMP 571 - Spring 2015 Luay Nakhleh, Rice University Outline Evolutionary models and distance corrections Distance-based methods Evolutionary Models and Distance Correction

More information

Evolutionary Models. Evolutionary Models

Evolutionary Models. Evolutionary Models Edit Operators In standard pairwise alignment, what are the allowed edit operators that transform one sequence into the other? Describe how each of these edit operations are represented on a sequence alignment

More information

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree) I9 Introduction to Bioinformatics, 0 Phylogenetic ree Reconstruction Yuzhen Ye (yye@indiana.edu) School of Informatics & omputing, IUB Evolution theory Speciation Evolution of new organisms is driven by

More information

Phylogenetics: Parsimony

Phylogenetics: Parsimony 1 Phylogenetics: Parsimony COMP 571 Luay Nakhleh, Rice University he Problem 2 Input: Multiple alignment of a set S of sequences Output: ree leaf-labeled with S Assumptions Characters are mutually independent

More information

Supplemental data. Pommerrenig et al. (2011). Plant Cell /tpc

Supplemental data. Pommerrenig et al. (2011). Plant Cell /tpc Supplemental Figure 1. Prediction of phloem-specific MTK1 expression in Arabidopsis shoots and roots. The images and the corresponding numbers showing absolute (A) or relative expression levels (B) of

More information

Phylogenetics. BIOL 7711 Computational Bioscience

Phylogenetics. BIOL 7711 Computational Bioscience Consortium for Comparative Genomics! University of Colorado School of Medicine Phylogenetics BIOL 7711 Computational Bioscience Biochemistry and Molecular Genetics Computational Bioscience Program Consortium

More information

Molecular Evolution and Phylogenetic Tree Reconstruction

Molecular Evolution and Phylogenetic Tree Reconstruction 1 4 Molecular Evolution and Phylogenetic Tree Reconstruction 3 2 5 1 4 2 3 5 Orthology, Paralogy, Inparalogs, Outparalogs Phylogenetic Trees Nodes: species Edges: time of independent evolution Edge length

More information

Phylogeny: traditional and Bayesian approaches

Phylogeny: traditional and Bayesian approaches Phylogeny: traditional and Bayesian approaches 5-Feb-2014 DEKM book Notes from Dr. B. John Holder and Lewis, Nature Reviews Genetics 4, 275-284, 2003 1 Phylogeny A graph depicting the ancestor-descendent

More information

A Phylogenetic Network Construction due to Constrained Recombination

A Phylogenetic Network Construction due to Constrained Recombination A Phylogenetic Network Construction due to Constrained Recombination Mohd. Abdul Hai Zahid Research Scholar Research Supervisors: Dr. R.C. Joshi Dr. Ankush Mittal Department of Electronics and Computer

More information

USING INVARIANTS FOR PHYLOGENETIC TREE CONSTRUCTION

USING INVARIANTS FOR PHYLOGENETIC TREE CONSTRUCTION USING INVARIANTS FOR PHYLOGENETIC TREE CONSTRUCTION NICHOLAS ERIKSSON Abstract. Phylogenetic invariants are certain polynomials in the joint probability distribution of a Markov model on a phylogenetic

More information

arxiv: v1 [q-bio.pe] 23 Nov 2017

arxiv: v1 [q-bio.pe] 23 Nov 2017 DIMENSIONS OF GROUP-BASED PHYLOGENETIC MIXTURES arxiv:1711.08686v1 [q-bio.pe] 23 Nov 2017 HECTOR BAÑOS, NATHANIEL BUSHEK, RUTH DAVIDSON, ELIZABETH GROSS, PAMELA E. HARRIS, ROBERT KRONE, COLBY LONG, ALLEN

More information

Phylogenetic Networks, Trees, and Clusters

Phylogenetic Networks, Trees, and Clusters Phylogenetic Networks, Trees, and Clusters Luay Nakhleh 1 and Li-San Wang 2 1 Department of Computer Science Rice University Houston, TX 77005, USA nakhleh@cs.rice.edu 2 Department of Biology University

More information

A (short) introduction to phylogenetics

A (short) introduction to phylogenetics A (short) introduction to phylogenetics Thibaut Jombart, Marie-Pauline Beugin MRC Centre for Outbreak Analysis and Modelling Imperial College London Genetic data analysis with PR Statistics, Millport Field

More information

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types Exp 11- THEORY Sequence Alignment is a process of aligning two sequences to achieve maximum levels of identity between them. This help to derive functional, structural and evolutionary relationships between

More information

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky MOLECULAR PHYLOGENY "Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky EVOLUTION - theory that groups of organisms change over time so that descendeants differ structurally

More information

Geometry of Phylogenetic Inference

Geometry of Phylogenetic Inference Geometry of Phylogenetic Inference Matilde Marcolli CS101: Mathematical and Computational Linguistics Winter 2015 References N. Eriksson, K. Ranestad, B. Sturmfels, S. Sullivant, Phylogenetic algebraic

More information

A Minimum Spanning Tree Framework for Inferring Phylogenies

A Minimum Spanning Tree Framework for Inferring Phylogenies A Minimum Spanning Tree Framework for Inferring Phylogenies Daniel Giannico Adkins Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-010-157

More information

BMI/CS 776 Lecture 4. Colin Dewey

BMI/CS 776 Lecture 4. Colin Dewey BMI/CS 776 Lecture 4 Colin Dewey 2007.02.01 Outline Common nucleotide substitution models Directed graphical models Ancestral sequence inference Poisson process continuous Markov process X t0 X t1 X t2

More information

Inferring Phylogenetic Trees. Distance Approaches. Representing distances. in rooted and unrooted trees. The distance approach to phylogenies

Inferring Phylogenetic Trees. Distance Approaches. Representing distances. in rooted and unrooted trees. The distance approach to phylogenies Inferring Phylogenetic Trees Distance Approaches Representing distances in rooted and unrooted trees The distance approach to phylogenies given: an n n matrix M where M ij is the distance between taxa

More information

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks! Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks! Paul has many great tools for teaching phylogenetics at his web site: http://hydrodictyon.eeb.uconn.edu/people/plewis

More information

Reconstructing Trees from Subtree Weights

Reconstructing Trees from Subtree Weights Reconstructing Trees from Subtree Weights Lior Pachter David E Speyer October 7, 2003 Abstract The tree-metric theorem provides a necessary and sufficient condition for a dissimilarity matrix to be a tree

More information

1. Can we use the CFN model for morphological traits?

1. Can we use the CFN model for morphological traits? 1. Can we use the CFN model for morphological traits? 2. Can we use something like the GTR model for morphological traits? 3. Stochastic Dollo. 4. Continuous characters. Mk models k-state variants of the

More information

Quartet Inference from SNP Data Under the Coalescent Model

Quartet Inference from SNP Data Under the Coalescent Model Bioinformatics Advance Access published August 7, 2014 Quartet Inference from SNP Data Under the Coalescent Model Julia Chifman 1 and Laura Kubatko 2,3 1 Department of Cancer Biology, Wake Forest School

More information

Identifiability of the GTR+Γ substitution model (and other models) of DNA evolution

Identifiability of the GTR+Γ substitution model (and other models) of DNA evolution Identifiability of the GTR+Γ substitution model (and other models) of DNA evolution Elizabeth S. Allman Dept. of Mathematics and Statistics University of Alaska Fairbanks TM Current Challenges and Problems

More information

Maximum Likelihood Until recently the newest method. Popularized by Joseph Felsenstein, Seattle, Washington.

Maximum Likelihood Until recently the newest method. Popularized by Joseph Felsenstein, Seattle, Washington. Maximum Likelihood This presentation is based almost entirely on Peter G. Fosters - "The Idiot s Guide to the Zen of Likelihood in a Nutshell in Seven Days for Dummies, Unleashed. http://www.bioinf.org/molsys/data/idiots.pdf

More information

Phylogenetic inference: from sequences to trees

Phylogenetic inference: from sequences to trees W ESTFÄLISCHE W ESTFÄLISCHE W ILHELMS -U NIVERSITÄT NIVERSITÄT WILHELMS-U ÜNSTER MM ÜNSTER VOLUTIONARY FUNCTIONAL UNCTIONAL GENOMICS ENOMICS EVOLUTIONARY Bioinformatics 1 Phylogenetic inference: from sequences

More information

Math 671: Tensor Train decomposition methods

Math 671: Tensor Train decomposition methods Math 671: Eduardo Corona 1 1 University of Michigan at Ann Arbor December 8, 2016 Table of Contents 1 Preliminaries and goal 2 Unfolding matrices for tensorized arrays The Tensor Train decomposition 3

More information

Biology 644: Bioinformatics

Biology 644: Bioinformatics A stochastic (probabilistic) model that assumes the Markov property Markov property is satisfied when the conditional probability distribution of future states of the process (conditional on both past

More information

arxiv: v3 [math.ag] 21 Aug 2008

arxiv: v3 [math.ag] 21 Aug 2008 ON THE IDEALS OF EQUIVARIANT TREE MODELS arxiv:0712.3230v3 [math.ag] 21 Aug 2008 JAN DRAISMA AND JOCHEN KUTTLER Abstract. We introduce equivariant tree models in algebraic statistics, which unify and generalise

More information

Mathematical Biology. Phylogenetic mixtures and linear invariants for equal input models. B Mike Steel. Marta Casanellas 1 Mike Steel 2

Mathematical Biology. Phylogenetic mixtures and linear invariants for equal input models. B Mike Steel. Marta Casanellas 1 Mike Steel 2 J. Math. Biol. DOI 0.007/s00285-06-055-8 Mathematical Biology Phylogenetic mixtures and linear invariants for equal input models Marta Casanellas Mike Steel 2 Received: 5 February 206 / Revised: July 206

More information

arxiv: v1 [math.ra] 13 Jan 2009

arxiv: v1 [math.ra] 13 Jan 2009 A CONCISE PROOF OF KRUSKAL S THEOREM ON TENSOR DECOMPOSITION arxiv:0901.1796v1 [math.ra] 13 Jan 2009 JOHN A. RHODES Abstract. A theorem of J. Kruskal from 1977, motivated by a latent-class statistical

More information

What Is Conservation?

What Is Conservation? What Is Conservation? Lee A. Newberg February 22, 2005 A Central Dogma Junk DNA mutates at a background rate, but functional DNA exhibits conservation. Today s Question What is this conservation? Lee A.

More information

Advanced topics in bioinformatics

Advanced topics in bioinformatics Feinberg Graduate School of the Weizmann Institute of Science Advanced topics in bioinformatics Shmuel Pietrokovski & Eitan Rubin Spring 2003 Course WWW site: http://bioinformatics.weizmann.ac.il/courses/atib

More information

Estimating Phylogenies (Evolutionary Trees) II. Biol4230 Thurs, March 2, 2017 Bill Pearson Jordan 6-057

Estimating Phylogenies (Evolutionary Trees) II. Biol4230 Thurs, March 2, 2017 Bill Pearson Jordan 6-057 Estimating Phylogenies (Evolutionary Trees) II Biol4230 Thurs, March 2, 2017 Bill Pearson wrp@virginia.edu 4-2818 Jordan 6-057 Tree estimation strategies: Parsimony?no model, simply count minimum number

More information

Stochastic Errors vs. Modeling Errors in Distance Based Phylogenetic Reconstructions

Stochastic Errors vs. Modeling Errors in Distance Based Phylogenetic Reconstructions PLGW05 Stochastic Errors vs. Modeling Errors in Distance Based Phylogenetic Reconstructions 1 joint work with Ilan Gronau 2, Shlomo Moran 3, and Irad Yavneh 3 1 2 Dept. of Biological Statistics and Computational

More information

Maximum Likelihood Estimates for Binary Random Variables on Trees via Phylogenetic Ideals

Maximum Likelihood Estimates for Binary Random Variables on Trees via Phylogenetic Ideals Maximum Likelihood Estimates for Binary Random Variables on Trees via Phylogenetic Ideals Robin Evans Abstract In their 2007 paper, E.S. Allman and J.A. Rhodes characterise the phylogenetic ideal of general

More information

Pitfalls of Heterogeneous Processes for Phylogenetic Reconstruction

Pitfalls of Heterogeneous Processes for Phylogenetic Reconstruction Pitfalls of Heterogeneous Processes for Phylogenetic Reconstruction Daniel Štefankovič Eric Vigoda June 30, 2006 Department of Computer Science, University of Rochester, Rochester, NY 14627, and Comenius

More information

Lecture 9 : Identifiability of Markov Models

Lecture 9 : Identifiability of Markov Models Lecture 9 : Identifiability of Markov Models MATH285K - Spring 2010 Lecturer: Sebastien Roch References: [SS03, Chapter 8]. Previous class THM 9.1 (Uniqueness of tree metric representation) Let δ be a

More information

Substitution = Mutation followed. by Fixation. Common Ancestor ACGATC 1:A G 2:C A GAGATC 3:G A 6:C T 5:T C 4:A C GAAATT 1:G A

Substitution = Mutation followed. by Fixation. Common Ancestor ACGATC 1:A G 2:C A GAGATC 3:G A 6:C T 5:T C 4:A C GAAATT 1:G A GAGATC 3:G A 6:C T Common Ancestor ACGATC 1:A G 2:C A Substitution = Mutation followed 5:T C by Fixation GAAATT 4:A C 1:G A AAAATT GAAATT GAGCTC ACGACC Chimp Human Gorilla Gibbon AAAATT GAAATT GAGCTC ACGACC

More information

CSCI1950 Z Computa4onal Methods for Biology Lecture 4. Ben Raphael February 2, hhp://cs.brown.edu/courses/csci1950 z/ Algorithm Summary

CSCI1950 Z Computa4onal Methods for Biology Lecture 4. Ben Raphael February 2, hhp://cs.brown.edu/courses/csci1950 z/ Algorithm Summary CSCI1950 Z Computa4onal Methods for Biology Lecture 4 Ben Raphael February 2, 2009 hhp://cs.brown.edu/courses/csci1950 z/ Algorithm Summary Parsimony Probabilis4c Method Input Output Sankoff s & Fitch

More information

A new parametrization for binary hidden Markov modes

A new parametrization for binary hidden Markov modes A new parametrization for binary hidden Markov models Andrew Critch, UC Berkeley at Pennsylvania State University June 11, 2012 See Binary hidden Markov models and varieties [, 2012], arxiv:1206.0500,

More information

Weighted Quartets Phylogenetics

Weighted Quartets Phylogenetics Weighted Quartets Phylogenetics Yunan Luo E. Avni, R. Cohen, and S. Snir. Weighted quartets phylogenetics. Systematic Biology, 2014. syu087 Problem: quartet-based supertree Input Output A B C D A C D E

More information

Counting phylogenetic invariants in some simple cases. Joseph Felsenstein. Department of Genetics SK-50. University of Washington

Counting phylogenetic invariants in some simple cases. Joseph Felsenstein. Department of Genetics SK-50. University of Washington Counting phylogenetic invariants in some simple cases Joseph Felsenstein Department of Genetics SK-50 University of Washington Seattle, Washington 98195 Running Headline: Counting Phylogenetic Invariants

More information

进化树构建方法的概率方法 第 4 章 : 进化树构建的概率方法 问题介绍. 部分 lid 修改自 i i f l 的 ih l i

进化树构建方法的概率方法 第 4 章 : 进化树构建的概率方法 问题介绍. 部分 lid 修改自 i i f l 的 ih l i 第 4 章 : 进化树构建的概率方法 问题介绍 进化树构建方法的概率方法 部分 lid 修改自 i i f l 的 ih l i 部分 Slides 修改自 University of Basel 的 Michael Springmann 课程 CS302 Seminar Life Science Informatics 的讲义 Phylogenetic Tree branch internal node

More information

Lecture Notes: Markov chains

Lecture Notes: Markov chains Computational Genomics and Molecular Biology, Fall 5 Lecture Notes: Markov chains Dannie Durand At the beginning of the semester, we introduced two simple scoring functions for pairwise alignments: a similarity

More information

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees Erin Molloy and Tandy Warnow {emolloy2, warnow}@illinois.edu University of Illinois at Urbana

More information

Phylogenetics: Building Phylogenetic Trees

Phylogenetics: Building Phylogenetic Trees 1 Phylogenetics: Building Phylogenetic Trees COMP 571 Luay Nakhleh, Rice University 2 Four Questions Need to be Answered What data should we use? Which method should we use? Which evolutionary model should

More information

Phylogenetics. Andreas Bernauer, March 28, Expected number of substitutions using matrix algebra 2

Phylogenetics. Andreas Bernauer, March 28, Expected number of substitutions using matrix algebra 2 Phylogenetics Andreas Bernauer, andreas@carrot.mcb.uconn.edu March 28, 2004 Contents 1 ts:tr rate ratio vs. ts:tr ratio 1 2 Expected number of substitutions using matrix algebra 2 3 Why the GTR model can

More information

Markov Models & DNA Sequence Evolution

Markov Models & DNA Sequence Evolution 7.91 / 7.36 / BE.490 Lecture #5 Mar. 9, 2004 Markov Models & DNA Sequence Evolution Chris Burge Review of Markov & HMM Models for DNA Markov Models for splice sites Hidden Markov Models - looking under

More information

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University Phylogenetics: Building Phylogenetic Trees COMP 571 - Fall 2010 Luay Nakhleh, Rice University Four Questions Need to be Answered What data should we use? Which method should we use? Which evolutionary

More information

Introduction to Molecular Phylogeny

Introduction to Molecular Phylogeny Introduction to Molecular Phylogeny Starting point: a set of homologous, aligned DNA or protein sequences Result of the process: a tree describing evolutionary relationships between studied sequences =

More information

Phylogenetics: Likelihood

Phylogenetics: Likelihood 1 Phylogenetics: Likelihood COMP 571 Luay Nakhleh, Rice University The Problem 2 Input: Multiple alignment of a set S of sequences Output: Tree T leaf-labeled with S Assumptions 3 Characters are mutually

More information

Crick s early Hypothesis Revisited

Crick s early Hypothesis Revisited Crick s early Hypothesis Revisited Or The Existence of a Universal Coding Frame Ryan Rossi, Jean-Louis Lassez and Axel Bernal UPenn Center for Bioinformatics BIOINFORMATICS The application of computer

More information

Nature Structural & Molecular Biology: doi: /nsmb Supplementary Figure 1

Nature Structural & Molecular Biology: doi: /nsmb Supplementary Figure 1 Supplementary Figure 1 Zn 2+ -binding sites in USP18. (a) The two molecules of USP18 present in the asymmetric unit are shown. Chain A is shown in blue, chain B in green. Bound Zn 2+ ions are shown as

More information

Probabilistic modeling and molecular phylogeny

Probabilistic modeling and molecular phylogeny Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis Technical University of Denmark (DTU) What is a model? Mathematical

More information

Bioinformatics 1 -- lecture 9. Phylogenetic trees Distance-based tree building Parsimony

Bioinformatics 1 -- lecture 9. Phylogenetic trees Distance-based tree building Parsimony ioinformatics -- lecture 9 Phylogenetic trees istance-based tree building Parsimony (,(,(,))) rees can be represented in "parenthesis notation". Each set of parentheses represents a branch-point (bifurcation),

More information

Practical Linear Algebra: A Geometry Toolbox

Practical Linear Algebra: A Geometry Toolbox Practical Linear Algebra: A Geometry Toolbox Third edition Chapter 12: Gauss for Linear Systems Gerald Farin & Dianne Hansford CRC Press, Taylor & Francis Group, An A K Peters Book www.farinhansford.com/books/pla

More information

Isotropic models of evolution with symmetries

Isotropic models of evolution with symmetries Isotropic models of evolution with symmetries Weronika Buczyńska, Maria Donten, and Jaros law A. Wiśniewski Dedicated to Andrew Sommese for his 60th birthday. Abstract. We consider isotropic Markov models

More information

Page 1. Evolutionary Trees. Why build evolutionary tree? Outline

Page 1. Evolutionary Trees. Why build evolutionary tree? Outline Page Evolutionary Trees Russ. ltman MI S 7 Outline. Why build evolutionary trees?. istance-based vs. character-based methods. istance-based: Ultrametric Trees dditive Trees. haracter-based: Perfect phylogeny

More information

Tree of Life iological Sequence nalysis Chapter http://tolweb.org/tree/ Phylogenetic Prediction ll organisms on Earth have a common ancestor. ll species are related. The relationship is called a phylogeny

More information

Evolutionary Tree Analysis. Overview

Evolutionary Tree Analysis. Overview CSI/BINF 5330 Evolutionary Tree Analysis Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Backgrounds Distance-Based Evolutionary Tree Reconstruction Character-Based

More information

High throughput near infrared screening discovers DNA-templated silver clusters with peak fluorescence beyond 950 nm

High throughput near infrared screening discovers DNA-templated silver clusters with peak fluorescence beyond 950 nm Electronic Supplementary Material (ESI) for Nanoscale. This journal is The Royal Society of Chemistry 2018 High throughput near infrared screening discovers DNA-templated silver clusters with peak fluorescence

More information

Quantifying sequence similarity

Quantifying sequence similarity Quantifying sequence similarity Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, February 16 th 2016 After this lecture, you can define homology, similarity, and identity

More information

X X (2) X Pr(X = x θ) (3)

X X (2) X Pr(X = x θ) (3) Notes for 848 lecture 6: A ML basis for compatibility and parsimony Notation θ Θ (1) Θ is the space of all possible trees (and model parameters) θ is a point in the parameter space = a particular tree

More information