Phylogenetic Algebraic Geometry Seth Sullivant North Carolina State University January 4, 2012 Seth Sullivant (NCSU) Phylogenetic Algebraic Geometry January 4, 2012 1 / 28
Phylogenetics Problem Given a collection of species, find the tree that explains their history. Data consists of aligned DNA sequences from homologous genes Human: Chimp: Gorilla: Seth Sullivant (NCSU)...ACCGTGCAACGTGAACGA......ACCTTGGAAGGTAAACGA......ACCGTGCAACGTAAACTA... Phylogenetic Algebraic Geometry January 4, 2012 2 / 28
Model-Based Phylogenetics Use a probabilistic model of mutations Parameters for the model are the combinatorial tree T, and rate parameters for mutations on each edge Models give a probability for observing a particular aligned collection of DNA sequences Human: Chimp: Gorilla: ACCGTGCAACGTGAACGA ACGTTGCAAGGTAAACGA ACCGTGCAACGTAAACTA Assuming site independence, data is summarized by empirical distribution of columns in the alignment. e.g. ˆp(AAA) = 5 18, ˆp(CGC) = 2 18, etc. Use empirical distribution and test statistic to find tree best explaining data Seth Sullivant (NCSU) Phylogenetic Algebraic Geometry January 4, 2012 3 / 28
Phylogenetic Models Assuming site independence: Phylogenetic Model is a latent class graphical model Vertex v T gives a random variable X v {A, C, G, T} All random variables corresponding to internal nodes are latent X 1 X 2 X 3 Y 2 Y 1 P(x 1, x 2, x 3 ) = y 1 y 2 P(y 1 )P(y 2 y 1 )P(x 1 y 1 )P(x 2 y 2 )P(x 3 y 2 ) Seth Sullivant (NCSU) Phylogenetic Algebraic Geometry January 4, 2012 4 / 28
Phylogenetic Models Assuming site independence: Phylogenetic Model is a latent class graphical model Vertex v T gives a random variable X v {A, C, G, T} All random variables corresponding to internal nodes are latent X 1 X 2 X 3 Y 2 Y 1 p i1 i 2 i 3 = j 1 j 2 π j1 a j2,j 1 b i1,j 1 c i2,j 2 d i3,j 2 Seth Sullivant (NCSU) Phylogenetic Algebraic Geometry January 4, 2012 5 / 28
Algebraic Perspective on Phylogenetic Models Once we fix a tree T and model structure, we get a map φ T : Θ R 4n. Θ R d is a parameter space of numerical parameters (transition matrices associated to each edge). The map φ T is given by polynomial functions of the parameters. For each i 1 i n {A, C, G, T } n, φ T i 1 i n (θ) gives the probability of the column (i 1,..., i n ) in the alignment for the particular parameter choice θ. φ T i 1 i 2 i 3 (π, a, b, c, d) = π j1 a j2,j 1 b i1,j 1 c i2,j 2 d i3,j 2 j 1 j 2 The phylogenetic model is the set M T = φ T (Θ) R 4n. Seth Sullivant (NCSU) Phylogenetic Algebraic Geometry January 4, 2012 6 / 28
Phylogenetic Varieties and Phylogenetic Invariants Let R[p] := R[p i1 i n : i 1 i n {A, C, G, T } n ] Definition Let I T := f R[p] : f (p) = 0 for all p M T R[p]. I T is the ideal of phylogenetic invariants of T. Let V T := {p R 4n : f (p) = 0 for all f I T }. V T is the phylogenetic variety of T. Note that M T V T. Since M T is image of a polynomial map dim M T = dim V T. Seth Sullivant (NCSU) Phylogenetic Algebraic Geometry January 4, 2012 7 / 28
X 1 X X X 2 3 4 p lmno = 4 4 4 π i a ij b ik c jl d jm e kn f ko i=1 j=1 k=1 Seth Sullivant (NCSU) Phylogenetic Algebraic Geometry January 4, 2012 8 / 28
X 1 X X X 2 3 4 p lmno = 4 4 4 π i a ij b ik c jl d jm e kn f ko i=1 j=1 k=1 4 a ij c jl d jm j=1 Seth Sullivant (NCSU) Phylogenetic Algebraic Geometry January 4, 2012 8 / 28
X 1 X X X 2 3 4 p lmno = 4 4 4 π i a ij b ik c jl d jm e kn f ko i=1 j=1 k=1 4 4 a ij c jl d jm b ik e kn f ko j=1 k =1 Seth Sullivant (NCSU) Phylogenetic Algebraic Geometry January 4, 2012 8 / 28
X 1 X X X 2 3 4 p lmno = = 4 4 4 π i a ij b ik c jl d jm e kn f ko i=1 j=1 k=1 4 4 4 a ij c jl d jm b ik e kn f ko π i i=1 j=1 k =1 Seth Sullivant (NCSU) Phylogenetic Algebraic Geometry January 4, 2012 8 / 28
X 1 X X X 2 3 4 p lmno = = 4 4 4 π i a ij b ik c jl d jm e kn f ko i=1 j=1 k=1 4 4 4 a ij c jl d jm b ik e kn f ko π i i=1 j=1 k =1 p 1111 p 1112 p 1144 p = rank 1211 p 1212 p 1244...... 4 p 4411 p 4412 p 4444 Seth Sullivant (NCSU) Phylogenetic Algebraic Geometry January 4, 2012 8 / 28
Splits and Phylogenetic Invariants Definition A split of a set is a bipartition A B. A split A B of the leaves of a tree T is valid for T if the induced trees T A and T B do not intersect. X 1 X X X 2 3 4 Valid: 12 34 Not Valid: 13 24 Seth Sullivant (NCSU) Phylogenetic Algebraic Geometry January 4, 2012 9 / 28
2-way Flattenings and Matrix Ranks Proposition Let P M T. p ijkl = P(X 1 = i, X 2 = j, X 3 = k, X 4 = l) p AAAA p AAAC p AAAG p AATT p ACAA p ACAC p ACAG p ACTT Flat 12 34 (P) =....... p TTAA p TTAC p TTAG p TTTT If A B is a valid split for T, then rank(flat A B (P)) 4. Invariants in I T are subdeterminants of Flat A B (P). If C D is not a valid split for T, then generically rank(flat C D (P)) > 4. Seth Sullivant (NCSU) Phylogenetic Algebraic Geometry January 4, 2012 10 / 28
Phylogenetic Algebraic Geometry Phylogenetic Algebraic Geometry is the study of the phylogenetic varieties and ideals V T and I T. Using Phylogenetic Invariants to Reconstruct Trees Identifiability of Phylogenetic Models Cool Math For Its Own Sake Seth Sullivant (NCSU) Phylogenetic Algebraic Geometry January 4, 2012 11 / 28
Using Phylogenetic Invariants to Reconstruct Trees Definition A phylogenetic invariant f I T is phylogenetically informative if there is some other tree T such that f / I T. Idea of Cavender-Felsenstein (1987), Lake (1987): Evaluate phylogenetically informative phylogenetic invariants at empirical distribution ˆp to reconstruct phylogenetic trees Proposition For each n-leaf trivalent tree T, let F T I T be a set of phylogenetic invariants such that, for each T T, there is an f F T, such that f / I T. Let f T := f F T f. Then for generic p M T, f T (p) = 0 if and only if p M T. Seth Sullivant (NCSU) Phylogenetic Algebraic Geometry January 4, 2012 12 / 28
Performance of Invariants Methods in Simulations Huelsenbeck (1995) did a systematic simulation comparison of 26 different methods of constructing a phylogenetic tree on 4 leaf trees. Invariant-based methods did poorly. HOWEVER... Huelsenbeck only used linear invariants. Casanellas, Fernandez-Sanchez (2006) redid these simulations using a generating set of the phylogenetic ideal I T. Phylogenetic invariants become comparable to other methods. For the particular model studied in Casanellas, Fernandez-Sanchez (2006) for a tree with 4 leaves, the ideal I T has 8002 generators. f T := f F T f is a sum of 8002 terms. Major work to overcome combinatorial explosion for larger trees. Seth Sullivant (NCSU) Phylogenetic Algebraic Geometry January 4, 2012 13 / 28
Identifiability of Phylogenetic Models Definition A parametric statistical model is identifiable if it gives 1-to-1 map from parameters to probability distributions. Is it possible to infer the parameters of the model from data? Identifiability guarantees consistency of statistical methods (ML) Two types of parameters to consider for phylogenetic models: Numerical parameters (transition matrices) Tree parameter (combinatorial type of tree) Seth Sullivant (NCSU) Phylogenetic Algebraic Geometry January 4, 2012 14 / 28
Geometric Perspective on Identifiability Definition The tree parameter T in a phylogenetic model is identifiable if for all p M T there does not exist another T T such that p M T. M 1 M 2 M1 M2 M 3 Identifiable Not Identifiable Seth Sullivant (NCSU) Phylogenetic Algebraic Geometry January 4, 2012 15 / 28
Generic Identifiability Definition The tree parameter in a phylogenetic model is generically identifiable if for all n-leaf trees with T T, dim(m T M T ) < min(dim(m T ), dim(m T )). M1 M2 M3 Seth Sullivant (NCSU) Phylogenetic Algebraic Geometry January 4, 2012 16 / 28
Proving Identifiability with Algebraic Geometry Proposition Let M 0 and M 1 be two algebraic models. If there exist phylogenetic invariants f 0 and f 1 such that f i (p) = 0 for all p M i, and f i (q) 0 for some q M 1 i, then dim(m 0 M 1 ) < min(dim M 0, dim M 1 ). M0 f = 0 M 1 Seth Sullivant (NCSU) Phylogenetic Algebraic Geometry January 4, 2012 17 / 28
Phylogenetic Models are Identifiable Theorem The tree parameter of phylogenetic models is generically identifiable. Proof. Edge flattening invariants can detect which splits are implied by a specific distribution in M T. The splits in T uniquely determine T. Seth Sullivant (NCSU) Phylogenetic Algebraic Geometry January 4, 2012 18 / 28
Phylogenetic Mixture Models Basic phylogenetic model assume site independence This assumption is not accurate within a single gene Some sites more important: unlikely to change Tree structure may vary across genes Leads to mixture models for different classes of sites M(T, r) denotes a same tree mixture model with underlying tree T and r classes of sites Seth Sullivant (NCSU) Phylogenetic Algebraic Geometry January 4, 2012 19 / 28
Theorem (Rhodes-Sullivant 2011) The tree and numerical parameters in a r-class, same tree phylogenetic mixture model on n-leaf trivalent trees are generically identifiable, if r < 4 n/4. Proof Ideas. Phylogenetic invariants from flattenings Tensor rank (Kruskal s Theorem) [Allman-Matias-Rhodes 2009] Elementary tree combinatorics Solving tree and numerical parameter identifiability at the same time Seth Sullivant (NCSU) Phylogenetic Algebraic Geometry January 4, 2012 20 / 28
Phylogenetic Varieties What are They? Question Phylogenetic algebraic geometry is filled with interesting varieties and ideals. Are they familiar objects from classical algebraic geometry? Definition The Cavender-Farris-Neyman model (CFN) is the phylogenetic model where each random variable has two states {R, Y }, or {0, 1}, and the Markov transition matrix is symmetric. ( ) 1 ae a e a e 1 a e Seth Sullivant (NCSU) Phylogenetic Algebraic Geometry January 4, 2012 21 / 28
The Two-State Model is a Toric Variety Consider binary states in CFN as {0, 1} = Z/2Z. Transitions probabilities satisfy Prob(X = g Y = h) = f (g + h). This means that the formula for Prob(X 1 = g 1,..., X n = g n ) is a convolution (over (Z/2Z) n ). Apply discrete Fourier transform to turn convolution into a product. Theorem (Hendy-Penny 1993, Evans-Speed 1993) In the Fourier coordinates, the CFN model is parametrized by monomial functions in terms of the Fourier parameters. In particular, the CFN model is a toric variety. Seth Sullivant (NCSU) Phylogenetic Algebraic Geometry January 4, 2012 22 / 28
Equations for the CFN Model Theorem (Sturmfels-Sullivant 2005) For any tree T, the toric ideal I T for the CFN model is generated by degree 2 determinantal equations. X 1 X 2 X3 X4 Fourier coordinates: q lmno = r,s,t,u {0,1} ( 1)rl+sm+tn+uo p rstu ( ) q0000 q 0001 q 0010 q 0011 q 1100 q 1101 q 1110 q 1111 ( q0100 q 0101 q 0110 ) q 0111 q 1000 q 1001 q 1010 q 1011 I T generated by 2 2 minors of: q 0000 q 0011 q 0001 q 0010 q 0100 q 0111 q 0101 q 0110 q 1000 q 1011 q 1001 q 1010 q 1100 q 1111 q 1101 q 1110 Seth Sullivant (NCSU) Phylogenetic Algebraic Geometry January 4, 2012 23 / 28
Buczynska-Wisniewski Theorem Definition For graded K-algebra R = t N R t, its Hilbert function is Hilb R (t) = dim K R t. For large t >> 0, Hilb R (t) equals a polynomial: the Hilbert polynomial. The Hilbert polynomial of a projective variety V is the Hilbert polynomial of the homogeneous coordinate ring K[p]/I(V ). Theorem (Buczynska-Wisniewski 2007) For any trivalent tree T with n leaves, under the CFN model the phylogenetic varieties V T have the same Hilbert polynomial. Seth Sullivant (NCSU) Phylogenetic Algebraic Geometry January 4, 2012 24 / 28
The Hilbert Scheme Definition The Hilbert scheme is a quasi-projective scheme parametrizing all ideals in K[p] with a given fixed Hilbert polynomial. T T T The BW theorem was proven by combinatorially shifting T to another tree T, which preserves the Hilbert function. These are deformation/degeneration steps in the Hilbert scheme. Seth Sullivant (NCSU) Phylogenetic Algebraic Geometry January 4, 2012 25 / 28
Cox Rings Theorem (Sturmfels-Xu 2010) Let T be any trivalent tree with n leaves. Then K[p]/I T is a flat degeneration of the Cox-Nagata ring of the blow-up of P n 3 in n points: Proof Sketch. Cox (Bl n P n 3 ) Castravet-Tevelev Theorem on finite generation of Cox (Bl n P n 3 ), and connection to Hilbert s 14th problem. Verlinde formula from mathematical physics (which is the multigraded Hilbert function common to all rings K[p]/I T ). SAGBI degeneration. Generalized to arbitrary genus (and arbitrary trivalent graphs) in [Manon 2009] Seth Sullivant (NCSU) Phylogenetic Algebraic Geometry January 4, 2012 26 / 28
Summary Phylogenetic models are fundamentally algebraic-geometric objects. Algebraic perspective is useful for: Developing new construction algorithms Proving theorems about identifiability (currently best available for mixture models) Phylogenetics motivates lots of interesting new mathematics Long way to go: Your Help Needed! Seth Sullivant (NCSU) Phylogenetic Algebraic Geometry January 4, 2012 27 / 28
References E. Allman, C. Matias, J. Rhodes. Identifiability of parameters in latent structure models with many observed variables. Annals of Statistics, 37 no.6a (2009) 3099-3132. M. Casanellas, J. Fernandez-Sanchez. Performance of a new invariants method on homogeneous and non-homogeneous quartet trees, Molecular Biology and Evolution, 24(1):288-293, 2006. J. Cavender, J. Felsenstein. Invariants of phylogenies: a simple case with discrete states. Journal of Classification 4 (1987) 57-71. S. Evans and T. Speed. Invariants of some probability models used in phylogenetic inference. Annals of Statistics 21 (1993) 355-377. M. Hendy, D. Penny. Spectral analysis of phylogenetic data. J. Classification 10 (1993) 5 24. J. Huelsenbeck. Performance of phylogenetic methods in simulation. Systematic Biology 42 no.1 (1995) 17 48. J. Lake. A rate-independent technique for analysis of nucleaic acid sequences: evolutionary parsimony. Molecular Biology and Evolution 4 (1987) 167-191. C. Manon. The algebra of conformal blocks, 2009. 0910.0577 J. Rhodes, S. Sullivant. Identifiability of large phylogenetic mixture models. To appear Bulletin of Mathematical Biology, 2011. 1011.4134 B. Sturmfels, S. Sullivant. Toric ideals of phylogenetic invariants. Journal of Computational Biology 12 (2005) 204-228. B. Sturmfels, Z. Xu. Sagbi bases of Cox-Nagata rings. Journal of the European Mathematical Society 12 (2010) 429-459. Seth Sullivant (NCSU) Phylogenetic Algebraic Geometry January 4, 2012 28 / 28