Phylogenetic Algebraic Geometry

Similar documents
Algebraic Statistics Tutorial I

Introduction to Algebraic Statistics

Example: Hardy-Weinberg Equilibrium. Algebraic Statistics Tutorial I. Phylogenetics. Main Point of This Tutorial. Model-Based Phylogenetics

Phylogenetic invariants versus classical phylogenetics

Using algebraic geometry for phylogenetic reconstruction

Metric learning for phylogenetic invariants

PHYLOGENETIC ALGEBRAIC GEOMETRY

Open Problems in Algebraic Statistics

Species Tree Inference using SVDquartets

Geometry of Phylogenetic Inference

arxiv: v1 [q-bio.pe] 23 Nov 2017

Toric Fiber Products

Phylogeny of Mixture Models

ELIZABETH S. ALLMAN and JOHN A. RHODES ABSTRACT 1. INTRODUCTION

Total binomial decomposition (TBD) Thomas Kahle Otto-von-Guericke Universität Magdeburg

arxiv: v1 [q-bio.pe] 27 Oct 2011

Identifiability of the GTR+Γ substitution model (and other models) of DNA evolution

PRIMARY DECOMPOSITION FOR THE INTERSECTION AXIOM

USING INVARIANTS FOR PHYLOGENETIC TREE CONSTRUCTION

Maximum Agreement Subtrees

The Generalized Neighbor Joining method

Combinatorial Aspects of Tropical Geometry and its interactions with phylogenetics

mathematics Smoothness in Binomial Edge Ideals Article Hamid Damadi and Farhad Rahmati *

19 Tree Construction using Singular Value Decomposition

Maximum Likelihood Estimates for Binary Random Variables on Trees via Phylogenetic Ideals

Jed Chou. April 13, 2015

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

Dissimilarity maps on trees and the representation theory of GL n (C)

Geometry of Gaussoids

Phylogenetic Geometry

arxiv: v3 [q-bio.pe] 3 Dec 2011

Tutorial: Gaussian conditional independence and graphical models. Thomas Kahle Otto-von-Guericke Universität Magdeburg

Algebraic Invariants of Phylogenetic Trees

CSCI1950 Z Computa4onal Methods for Biology Lecture 5

1. Can we use the CFN model for morphological traits?

Pitfalls of Heterogeneous Processes for Phylogenetic Reconstruction

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Reconstruire le passé biologique modèles, méthodes, performances, limites

Combinatorics and geometry of E 7

The Maximum Likelihood Threshold of a Graph

The binomial ideal of the intersection axiom for conditional probabilities

arxiv: v6 [math.st] 19 Oct 2011

Commuting birth-and-death processes

Algebraic Complexity in Statistics using Combinatorial and Tensor Methods

Is Every Secant Variety of a Segre Product Arithmetically Cohen Macaulay? Oeding (Auburn) acm Secants March 6, / 23

Lecture 27. Phylogeny methods, part 7 (Bootstraps, etc.) p.1/30

INITIAL COMPLEX ASSOCIATED TO A JET SCHEME OF A DETERMINANTAL VARIETY. the affine space of dimension k over F. By a variety in A k F

Pitfalls of Heterogeneous Processes for Phylogenetic Reconstruction

LOVÁSZ-SAKS-SCHRIJVER IDEALS AND COORDINATE SECTIONS OF DETERMINANTAL VARIETIES

Lecture Notes: Markov chains

Stochastic Errors vs. Modeling Errors in Distance Based Phylogenetic Reconstructions

University of Warwick institutional repository:

EVOLUTIONARY DISTANCES

A few logs suce to build (almost) all trees: Part II

The statistical and informatics challenges posed by ascertainment biases in phylogenetic data collection

Isotropic models of evolution with symmetries

Toric Varieties in Statistics

Noetherianity up to symmetry

Evolutionary Tree Analysis. Overview

ALGEBRA: From Linear to Non-Linear. Bernd Sturmfels University of California at Berkeley

When Do Phylogenetic Mixture Models Mimic Other Phylogenetic Models?

Max-Planck-Institut für Mathematik in den Naturwissenschaften Leipzig

DIRAC S THEOREM ON CHORDAL GRAPHS

ON A CONJECTURE BY KALAI

Toric Deformation of the Hankel Variety

arxiv: v1 [q-bio.pe] 1 Jun 2014

A new parametrization for binary hidden Markov modes

The partial-fractions method for counting solutions to integral linear systems

arxiv: v1 [math.ra] 13 Jan 2009

PROBLEMS FOR VIASM MINICOURSE: GEOMETRY OF MODULI SPACES LAST UPDATED: DEC 25, 2013

On the extremal Betti numbers of binomial edge ideals of block graphs

Algebraic Classification of Small Bayesian Networks

GENERALIZED CAYLEY-CHOW COORDINATES AND COMPUTER VISION

A SURVEY OF CLUSTER ALGEBRAS

5. Grassmannians and the Space of Trees In this lecture we shall be interested in a very particular ideal. The ambient polynomial ring C[p] has ( n

Reconstructing Trees from Subtree Weights

The intersection axiom of

arxiv: v2 [math.ac] 17 Apr 2013

Multi-parameter persistent homology: applications and algorithms

Hypertoric varieties and hyperplane arrangements

An Introduction to Tropical Geometry

On the effective cone of P n blown-up at n + 3 points (and other positivity properties)

Binomial Exercises A = 1 1 and 1

Sheaf cohomology and non-normal varieties

BMI/CS 776 Lecture 4. Colin Dewey

Phylogenetic Networks, Trees, and Clusters

DISCRIMINANTS, SYMMETRIZED GRAPH MONOMIALS, AND SUMS OF SQUARES

Combinatorial semigroups and induced/deduced operators

1 Hilbert function. 1.1 Graded rings. 1.2 Graded modules. 1.3 Hilbert function

arxiv: v4 [math.ac] 31 May 2012

Some Algebraic and Combinatorial Properties of the Complete T -Partite Graphs

LINEAR RESOLUTIONS OF QUADRATIC MONOMIAL IDEALS. 1. Introduction. Let G be a simple graph (undirected with no loops or multiple edges) on the vertex

MOTIVES ASSOCIATED TO SUMS OF GRAPHS

9th and 10th Grade Math Proficiency Objectives Strand One: Number Sense and Operations

ALGEBRAIC AND COMBINATORIAL PROPERTIES OF CERTAIN TORIC IDEALS IN THEORY AND APPLICATIONS

CHARACTERIZATION OF GORENSTEIN STRONGLY KOSZUL HIBI RINGS BY F-INVARIANTS

Secant Varieties and Equations of Abo-Wan Hypersurfaces

DIANE MACLAGAN. Abstract. The main result of this paper is that all antichains are. One natural generalization to more abstract posets is shown to be

A Phylogenetic Network Construction due to Constrained Recombination

CSCI1950 Z Computa4onal Methods for Biology Lecture 4. Ben Raphael February 2, hhp://cs.brown.edu/courses/csci1950 z/ Algorithm Summary

Transcription:

Phylogenetic Algebraic Geometry Seth Sullivant North Carolina State University January 4, 2012 Seth Sullivant (NCSU) Phylogenetic Algebraic Geometry January 4, 2012 1 / 28

Phylogenetics Problem Given a collection of species, find the tree that explains their history. Data consists of aligned DNA sequences from homologous genes Human: Chimp: Gorilla: Seth Sullivant (NCSU)...ACCGTGCAACGTGAACGA......ACCTTGGAAGGTAAACGA......ACCGTGCAACGTAAACTA... Phylogenetic Algebraic Geometry January 4, 2012 2 / 28

Model-Based Phylogenetics Use a probabilistic model of mutations Parameters for the model are the combinatorial tree T, and rate parameters for mutations on each edge Models give a probability for observing a particular aligned collection of DNA sequences Human: Chimp: Gorilla: ACCGTGCAACGTGAACGA ACGTTGCAAGGTAAACGA ACCGTGCAACGTAAACTA Assuming site independence, data is summarized by empirical distribution of columns in the alignment. e.g. ˆp(AAA) = 5 18, ˆp(CGC) = 2 18, etc. Use empirical distribution and test statistic to find tree best explaining data Seth Sullivant (NCSU) Phylogenetic Algebraic Geometry January 4, 2012 3 / 28

Phylogenetic Models Assuming site independence: Phylogenetic Model is a latent class graphical model Vertex v T gives a random variable X v {A, C, G, T} All random variables corresponding to internal nodes are latent X 1 X 2 X 3 Y 2 Y 1 P(x 1, x 2, x 3 ) = y 1 y 2 P(y 1 )P(y 2 y 1 )P(x 1 y 1 )P(x 2 y 2 )P(x 3 y 2 ) Seth Sullivant (NCSU) Phylogenetic Algebraic Geometry January 4, 2012 4 / 28

Phylogenetic Models Assuming site independence: Phylogenetic Model is a latent class graphical model Vertex v T gives a random variable X v {A, C, G, T} All random variables corresponding to internal nodes are latent X 1 X 2 X 3 Y 2 Y 1 p i1 i 2 i 3 = j 1 j 2 π j1 a j2,j 1 b i1,j 1 c i2,j 2 d i3,j 2 Seth Sullivant (NCSU) Phylogenetic Algebraic Geometry January 4, 2012 5 / 28

Algebraic Perspective on Phylogenetic Models Once we fix a tree T and model structure, we get a map φ T : Θ R 4n. Θ R d is a parameter space of numerical parameters (transition matrices associated to each edge). The map φ T is given by polynomial functions of the parameters. For each i 1 i n {A, C, G, T } n, φ T i 1 i n (θ) gives the probability of the column (i 1,..., i n ) in the alignment for the particular parameter choice θ. φ T i 1 i 2 i 3 (π, a, b, c, d) = π j1 a j2,j 1 b i1,j 1 c i2,j 2 d i3,j 2 j 1 j 2 The phylogenetic model is the set M T = φ T (Θ) R 4n. Seth Sullivant (NCSU) Phylogenetic Algebraic Geometry January 4, 2012 6 / 28

Phylogenetic Varieties and Phylogenetic Invariants Let R[p] := R[p i1 i n : i 1 i n {A, C, G, T } n ] Definition Let I T := f R[p] : f (p) = 0 for all p M T R[p]. I T is the ideal of phylogenetic invariants of T. Let V T := {p R 4n : f (p) = 0 for all f I T }. V T is the phylogenetic variety of T. Note that M T V T. Since M T is image of a polynomial map dim M T = dim V T. Seth Sullivant (NCSU) Phylogenetic Algebraic Geometry January 4, 2012 7 / 28

X 1 X X X 2 3 4 p lmno = 4 4 4 π i a ij b ik c jl d jm e kn f ko i=1 j=1 k=1 Seth Sullivant (NCSU) Phylogenetic Algebraic Geometry January 4, 2012 8 / 28

X 1 X X X 2 3 4 p lmno = 4 4 4 π i a ij b ik c jl d jm e kn f ko i=1 j=1 k=1 4 a ij c jl d jm j=1 Seth Sullivant (NCSU) Phylogenetic Algebraic Geometry January 4, 2012 8 / 28

X 1 X X X 2 3 4 p lmno = 4 4 4 π i a ij b ik c jl d jm e kn f ko i=1 j=1 k=1 4 4 a ij c jl d jm b ik e kn f ko j=1 k =1 Seth Sullivant (NCSU) Phylogenetic Algebraic Geometry January 4, 2012 8 / 28

X 1 X X X 2 3 4 p lmno = = 4 4 4 π i a ij b ik c jl d jm e kn f ko i=1 j=1 k=1 4 4 4 a ij c jl d jm b ik e kn f ko π i i=1 j=1 k =1 Seth Sullivant (NCSU) Phylogenetic Algebraic Geometry January 4, 2012 8 / 28

X 1 X X X 2 3 4 p lmno = = 4 4 4 π i a ij b ik c jl d jm e kn f ko i=1 j=1 k=1 4 4 4 a ij c jl d jm b ik e kn f ko π i i=1 j=1 k =1 p 1111 p 1112 p 1144 p = rank 1211 p 1212 p 1244...... 4 p 4411 p 4412 p 4444 Seth Sullivant (NCSU) Phylogenetic Algebraic Geometry January 4, 2012 8 / 28

Splits and Phylogenetic Invariants Definition A split of a set is a bipartition A B. A split A B of the leaves of a tree T is valid for T if the induced trees T A and T B do not intersect. X 1 X X X 2 3 4 Valid: 12 34 Not Valid: 13 24 Seth Sullivant (NCSU) Phylogenetic Algebraic Geometry January 4, 2012 9 / 28

2-way Flattenings and Matrix Ranks Proposition Let P M T. p ijkl = P(X 1 = i, X 2 = j, X 3 = k, X 4 = l) p AAAA p AAAC p AAAG p AATT p ACAA p ACAC p ACAG p ACTT Flat 12 34 (P) =....... p TTAA p TTAC p TTAG p TTTT If A B is a valid split for T, then rank(flat A B (P)) 4. Invariants in I T are subdeterminants of Flat A B (P). If C D is not a valid split for T, then generically rank(flat C D (P)) > 4. Seth Sullivant (NCSU) Phylogenetic Algebraic Geometry January 4, 2012 10 / 28

Phylogenetic Algebraic Geometry Phylogenetic Algebraic Geometry is the study of the phylogenetic varieties and ideals V T and I T. Using Phylogenetic Invariants to Reconstruct Trees Identifiability of Phylogenetic Models Cool Math For Its Own Sake Seth Sullivant (NCSU) Phylogenetic Algebraic Geometry January 4, 2012 11 / 28

Using Phylogenetic Invariants to Reconstruct Trees Definition A phylogenetic invariant f I T is phylogenetically informative if there is some other tree T such that f / I T. Idea of Cavender-Felsenstein (1987), Lake (1987): Evaluate phylogenetically informative phylogenetic invariants at empirical distribution ˆp to reconstruct phylogenetic trees Proposition For each n-leaf trivalent tree T, let F T I T be a set of phylogenetic invariants such that, for each T T, there is an f F T, such that f / I T. Let f T := f F T f. Then for generic p M T, f T (p) = 0 if and only if p M T. Seth Sullivant (NCSU) Phylogenetic Algebraic Geometry January 4, 2012 12 / 28

Performance of Invariants Methods in Simulations Huelsenbeck (1995) did a systematic simulation comparison of 26 different methods of constructing a phylogenetic tree on 4 leaf trees. Invariant-based methods did poorly. HOWEVER... Huelsenbeck only used linear invariants. Casanellas, Fernandez-Sanchez (2006) redid these simulations using a generating set of the phylogenetic ideal I T. Phylogenetic invariants become comparable to other methods. For the particular model studied in Casanellas, Fernandez-Sanchez (2006) for a tree with 4 leaves, the ideal I T has 8002 generators. f T := f F T f is a sum of 8002 terms. Major work to overcome combinatorial explosion for larger trees. Seth Sullivant (NCSU) Phylogenetic Algebraic Geometry January 4, 2012 13 / 28

Identifiability of Phylogenetic Models Definition A parametric statistical model is identifiable if it gives 1-to-1 map from parameters to probability distributions. Is it possible to infer the parameters of the model from data? Identifiability guarantees consistency of statistical methods (ML) Two types of parameters to consider for phylogenetic models: Numerical parameters (transition matrices) Tree parameter (combinatorial type of tree) Seth Sullivant (NCSU) Phylogenetic Algebraic Geometry January 4, 2012 14 / 28

Geometric Perspective on Identifiability Definition The tree parameter T in a phylogenetic model is identifiable if for all p M T there does not exist another T T such that p M T. M 1 M 2 M1 M2 M 3 Identifiable Not Identifiable Seth Sullivant (NCSU) Phylogenetic Algebraic Geometry January 4, 2012 15 / 28

Generic Identifiability Definition The tree parameter in a phylogenetic model is generically identifiable if for all n-leaf trees with T T, dim(m T M T ) < min(dim(m T ), dim(m T )). M1 M2 M3 Seth Sullivant (NCSU) Phylogenetic Algebraic Geometry January 4, 2012 16 / 28

Proving Identifiability with Algebraic Geometry Proposition Let M 0 and M 1 be two algebraic models. If there exist phylogenetic invariants f 0 and f 1 such that f i (p) = 0 for all p M i, and f i (q) 0 for some q M 1 i, then dim(m 0 M 1 ) < min(dim M 0, dim M 1 ). M0 f = 0 M 1 Seth Sullivant (NCSU) Phylogenetic Algebraic Geometry January 4, 2012 17 / 28

Phylogenetic Models are Identifiable Theorem The tree parameter of phylogenetic models is generically identifiable. Proof. Edge flattening invariants can detect which splits are implied by a specific distribution in M T. The splits in T uniquely determine T. Seth Sullivant (NCSU) Phylogenetic Algebraic Geometry January 4, 2012 18 / 28

Phylogenetic Mixture Models Basic phylogenetic model assume site independence This assumption is not accurate within a single gene Some sites more important: unlikely to change Tree structure may vary across genes Leads to mixture models for different classes of sites M(T, r) denotes a same tree mixture model with underlying tree T and r classes of sites Seth Sullivant (NCSU) Phylogenetic Algebraic Geometry January 4, 2012 19 / 28

Theorem (Rhodes-Sullivant 2011) The tree and numerical parameters in a r-class, same tree phylogenetic mixture model on n-leaf trivalent trees are generically identifiable, if r < 4 n/4. Proof Ideas. Phylogenetic invariants from flattenings Tensor rank (Kruskal s Theorem) [Allman-Matias-Rhodes 2009] Elementary tree combinatorics Solving tree and numerical parameter identifiability at the same time Seth Sullivant (NCSU) Phylogenetic Algebraic Geometry January 4, 2012 20 / 28

Phylogenetic Varieties What are They? Question Phylogenetic algebraic geometry is filled with interesting varieties and ideals. Are they familiar objects from classical algebraic geometry? Definition The Cavender-Farris-Neyman model (CFN) is the phylogenetic model where each random variable has two states {R, Y }, or {0, 1}, and the Markov transition matrix is symmetric. ( ) 1 ae a e a e 1 a e Seth Sullivant (NCSU) Phylogenetic Algebraic Geometry January 4, 2012 21 / 28

The Two-State Model is a Toric Variety Consider binary states in CFN as {0, 1} = Z/2Z. Transitions probabilities satisfy Prob(X = g Y = h) = f (g + h). This means that the formula for Prob(X 1 = g 1,..., X n = g n ) is a convolution (over (Z/2Z) n ). Apply discrete Fourier transform to turn convolution into a product. Theorem (Hendy-Penny 1993, Evans-Speed 1993) In the Fourier coordinates, the CFN model is parametrized by monomial functions in terms of the Fourier parameters. In particular, the CFN model is a toric variety. Seth Sullivant (NCSU) Phylogenetic Algebraic Geometry January 4, 2012 22 / 28

Equations for the CFN Model Theorem (Sturmfels-Sullivant 2005) For any tree T, the toric ideal I T for the CFN model is generated by degree 2 determinantal equations. X 1 X 2 X3 X4 Fourier coordinates: q lmno = r,s,t,u {0,1} ( 1)rl+sm+tn+uo p rstu ( ) q0000 q 0001 q 0010 q 0011 q 1100 q 1101 q 1110 q 1111 ( q0100 q 0101 q 0110 ) q 0111 q 1000 q 1001 q 1010 q 1011 I T generated by 2 2 minors of: q 0000 q 0011 q 0001 q 0010 q 0100 q 0111 q 0101 q 0110 q 1000 q 1011 q 1001 q 1010 q 1100 q 1111 q 1101 q 1110 Seth Sullivant (NCSU) Phylogenetic Algebraic Geometry January 4, 2012 23 / 28

Buczynska-Wisniewski Theorem Definition For graded K-algebra R = t N R t, its Hilbert function is Hilb R (t) = dim K R t. For large t >> 0, Hilb R (t) equals a polynomial: the Hilbert polynomial. The Hilbert polynomial of a projective variety V is the Hilbert polynomial of the homogeneous coordinate ring K[p]/I(V ). Theorem (Buczynska-Wisniewski 2007) For any trivalent tree T with n leaves, under the CFN model the phylogenetic varieties V T have the same Hilbert polynomial. Seth Sullivant (NCSU) Phylogenetic Algebraic Geometry January 4, 2012 24 / 28

The Hilbert Scheme Definition The Hilbert scheme is a quasi-projective scheme parametrizing all ideals in K[p] with a given fixed Hilbert polynomial. T T T The BW theorem was proven by combinatorially shifting T to another tree T, which preserves the Hilbert function. These are deformation/degeneration steps in the Hilbert scheme. Seth Sullivant (NCSU) Phylogenetic Algebraic Geometry January 4, 2012 25 / 28

Cox Rings Theorem (Sturmfels-Xu 2010) Let T be any trivalent tree with n leaves. Then K[p]/I T is a flat degeneration of the Cox-Nagata ring of the blow-up of P n 3 in n points: Proof Sketch. Cox (Bl n P n 3 ) Castravet-Tevelev Theorem on finite generation of Cox (Bl n P n 3 ), and connection to Hilbert s 14th problem. Verlinde formula from mathematical physics (which is the multigraded Hilbert function common to all rings K[p]/I T ). SAGBI degeneration. Generalized to arbitrary genus (and arbitrary trivalent graphs) in [Manon 2009] Seth Sullivant (NCSU) Phylogenetic Algebraic Geometry January 4, 2012 26 / 28

Summary Phylogenetic models are fundamentally algebraic-geometric objects. Algebraic perspective is useful for: Developing new construction algorithms Proving theorems about identifiability (currently best available for mixture models) Phylogenetics motivates lots of interesting new mathematics Long way to go: Your Help Needed! Seth Sullivant (NCSU) Phylogenetic Algebraic Geometry January 4, 2012 27 / 28

References E. Allman, C. Matias, J. Rhodes. Identifiability of parameters in latent structure models with many observed variables. Annals of Statistics, 37 no.6a (2009) 3099-3132. M. Casanellas, J. Fernandez-Sanchez. Performance of a new invariants method on homogeneous and non-homogeneous quartet trees, Molecular Biology and Evolution, 24(1):288-293, 2006. J. Cavender, J. Felsenstein. Invariants of phylogenies: a simple case with discrete states. Journal of Classification 4 (1987) 57-71. S. Evans and T. Speed. Invariants of some probability models used in phylogenetic inference. Annals of Statistics 21 (1993) 355-377. M. Hendy, D. Penny. Spectral analysis of phylogenetic data. J. Classification 10 (1993) 5 24. J. Huelsenbeck. Performance of phylogenetic methods in simulation. Systematic Biology 42 no.1 (1995) 17 48. J. Lake. A rate-independent technique for analysis of nucleaic acid sequences: evolutionary parsimony. Molecular Biology and Evolution 4 (1987) 167-191. C. Manon. The algebra of conformal blocks, 2009. 0910.0577 J. Rhodes, S. Sullivant. Identifiability of large phylogenetic mixture models. To appear Bulletin of Mathematical Biology, 2011. 1011.4134 B. Sturmfels, S. Sullivant. Toric ideals of phylogenetic invariants. Journal of Computational Biology 12 (2005) 204-228. B. Sturmfels, Z. Xu. Sagbi bases of Cox-Nagata rings. Journal of the European Mathematical Society 12 (2010) 429-459. Seth Sullivant (NCSU) Phylogenetic Algebraic Geometry January 4, 2012 28 / 28