Geometry of Phylogenetic Inference

Geometry of Phylogenetic Inference Matilde Marcolli CS101: Mathematical and Computational Linguistics Winter 2015

References N. Eriksson, K. Ranestad, B. Sturmfels, S. Sullivant, Phylogenetic algebraic geometry, in Projective varieties with unexpected properties, pp. 237 255, Walter de Gruyter, 2005. L. Pacher, B. Sturmfels, The Mathematics of Phylogenomics, SIAM Rev. 49 (2007), no. 1, 3 31. L. Pacher, B. Sturmfels, Tropical geometry of statistical models, Proc. Natl. Acad. Sci. USA 101 (2004), no. 46, 16132 16137 M. Drton, B. Sturmfels, S. Sullivant, Lectures on Algebraic Statistics, Birkhäuser, 2009.

Hidden Markov Models n observed states Y 1,..., Y n, each taking l possible values n hidden states X 1,..., X n, each taking k possible values conditional independence P(X i X 1,..., X i 1 ) = P(X i X i 1 ) P(Y i X 1,..., X i, Y 1,..., Y i 1 ) = P(Y i X i ) special case: all transitions X i 1 X i same k k-stochastic matrix P = (p ij ); all transitions X i Y i same k l-stochastic matrix T = (t ij )

a HMM described by the image of a polynomial map Φ : R k(k+1) R ln of degree n 1 bi-homogeneous in the coordinates p ij and t ij plus added positivity and normalization conditions (stochastic matrices and probability distributions) Example with k = l = 2 and n = 3, Φ = (Φ ijk ) : R 8 R 8 Φ ijk = p 00 p 00 t 0i t 0j t 0k + p 00 p 01 t 0i t 0j t 1k + p 01 p 10 t 0i t 1j t 0k + p 01 p 11 t 0i t 1j t 1k + p 10 p 00 t 1i t 0j t 0k + p 10 p 01 t 1i t 0j t 1k + p 11 p 10 t 1i t 1j t 0k + p 11 p 11 t 1i t 1j t 1k

invariants of the HMM: polynomial functions on R ln that vanish on the image of Φ ideal I Φ generated by invariants? small k, l, n Gröbner bases; larger computationally hard Questions Viterbi sequence: find the most likely hidden data given observed data find all parameter values for a model that result in the same observed distribution find what parameter-independent relations hold between the observed probabilities P i1,...,i n = Φ i1,...,i n

Phylogenetic Algebraic Geometry T a rooted binary tree with n leaves (hence 2n 2 edges) At each vertex a binary random variable (e.g. one of the syntactic parameters) Probability distribution at the root vertex π = (p, 1 p) Along each edge e transition matrix: stochastic matrix P e = (p (e) ij ) with i p(e) ij = 1 these represent the probabilities that a mutation in the parameter happens along that edge

Model Parameters the random variables at the leaves of the tree are observed; the random variables at the interior nodes are hidden (assuming no direct knowledge of the ancient languages in the family) matrix entries of transition matrices P e and probability π at root vertex are model parameters number of parameters N = (2n 2)k 2 + k (binary variable k = 2)

Polynomial Map at the n leaves there are k n = 2 n possible observations the probability of an observation at the leaves is a polynomial function of the parameters can view this as a complex polynomial Φ : C N C 2n plus some (real) normalization conditions polytope R N + C N determined by the conditions π 1 + π 2 = 1 and i p(e) ij = 1 with π i 0 and p (e) ij 0 Φ should map to a cube I n in C 2n where [0, 1] I C 2 is I = {(p 1, p 2 ) p 1 + p 2 = 1, p i 0}

Example Φ ijk = π 0 a 0i b 00 c 0j d 0k +π 0 a 0i b 01 c 1j d 1k +π 1 a 1i b 10 c 0j d 0k +π 1 a 1i b 11 c 1j d 1k there are 8 such polynomials: i, j, k {0, 1}

polynomial Φ is homogeneous in the parameters can view Φ as a map of projective spaces in the previous example Φ : C 4 C 4 C 4 C 4 C 2 C 8 Φ : P 3 (C) P 3 (C) P 3 (C) P 3 (C) P 1 (C) P 7 (C) homogeneous with respect to each group of variables a, b, c, d, π the fibers of this morphism give all possible values of parameters (before imposing real normalization conditions) that give a certain probability at the leaves

Algebraic varieties occurring in these models Toric varieties (including Segre varieties and Veronese varieties) Determinantal varieties: the tree structure imposes rank constraints on matrices built starting from observed probabilities at the leaves Example: Segre embedding P 1 P 1 P 1 P 1 P 15 p ijkl = u i v j w k x l i, j, k, l {0, 1}

Prime ideal defining this variety: generated by 2 2 minors of 4 4-matrices corresponding to

Secant variety of the Segre variety X nine-dimensional subvariety of P 15 given by al 2 2 2 2-tensors of rank at most 2 p ijkl = π 0 u 0i v 0j w 0k x 0l + π 1 u 1i v 1j w 1k w 1l Ideal generated by all 3 3-minors of previous matrices X = X (12)(34) X (13)(24) X (14)(23)

Determinantal variety each determinantal variety corresponds to a Markov model on one of the binary trees: X (12)(34) is defined by p ijkl = π 0 (a 00 u 0i v 0j + a 01 u 1i v 1j )(b 00 w 0k x 0l + b 01 w 1k x 1l ) + π 1 (a 10 u 0i v 0j + a 10 u 1i v 1j )(b 10 w 0k x 0l + b 11 w 1k x 1l ) this corresponds to vanishing of all 3 3-minors in first matrix stratification of P 2n 1 by phylogenetic models X

Special case: Jukes-Cantor model special case where all the edge matrices P e have the form ( ) p0 p P e = 1 p 1 p 0 it is known that in this case an explicit change of coordinates describes it as a toric variety. General Idea of Phylogenetic Algebraic Geometry generators of the ideal defining the complex variety = phylogenetic invariants which phylogenetic invariants suffice to distinguish between different Markov models? parameter inference from tropicalization of the algebraic variety

Tropical Semiring min-plus (or tropical) semiring T = R { }, with operations and given by x y = min{x, y}, with the identity element for and with with 0 the identity element for x y = x + y, operations and satisfy associativity and commutativity and distributivity of the product over the sum addition is no longer invertible and is idemponent x x = min{x, x} = x

Tropical polynomials function φ : R n R of the form φ(x 1,..., x n ) = m j=1a j x k j1 1 x k jn n = min{ a 1 + k 11 x 1 + + k 1n x n, a 2 + k 21 x 1 + + k 2n x n, a m + k m1 x 1 + + k mn x n }. tropicalization: algebraic varieties become piecewise linear spaces can recover information about a variety from its tropicalization

in the previous HMM example with n = 3 and k = l = 2 the tropicalization of the polynomials Φ ijk Φ ijk = p 00 p 00 t 0i t 0j t 0k + p 00 p 01 t 0i t 0j t 1k + p 01 p 10 t 0i t 1j t 0k + p 01 p 11 t 0i t 1j t 1k + p 10 p 00 t 1i t 0j t 0k + p 10 p 01 t 1i t 0j t 1k + p 11 p 10 t 1i t 1j t 0k + p 11 p 11 t 1i t 1j t 1k is given by τ ijk = min{u h1 h 2 + u h2 h 3 + v h1 i + v h2 j + v h3 k (h 1, h 2, h 3 ) {0, 1} 3 } where u ab = log(p ab ) and v ab = log(t ab ) Viterbi sequence: (h 1, h 2, h 3 ) realizing mimimum, given observed (i, j, k) is the Viterbi sequence of hidden data

Newton polytope polynomial f = ω Z n a ωx ω with x ω = x ω 1 1 x n ωn Newton polytope N (f ) = Convex Hull{ω Z n a ω 0} R n N (f + g) = N (f ) N (g) and N (f g) = N (f ) + N (g) (Minkowski sum of polytopes P + Q = {x + y x P, y Q} normal fan C(N (f )): normal cones of all faces C F (N (f )) C F (N (f )) = {w R n F = F w (N (f ))} F w (N (f )) = {x N (f ) (x y) w 0 y N (f )}

the set of parameters U = (u ab ), V = (v ab ) in tropicalization τ ijk of Φ ijk that determine the Viterbi sequence (h 1, h 2, h 3 ) is the normal cone to a vertex of the Newton polygon N (Φ ijk ) given observed data (i, j, k) and hidden data (h 1, h 2, h 3 ) the normal cones of N (Φ ijk ) give all parameter values for which (h 1, h 2, h 3 ) is the most likely explanation for the observed (i, j, k) domains of linearity of the piecewise linear tropical τ ijk are the cones in the normal fan C F (N (Φ ijk )); each maximal cone corresponds to one set of hidden data (h 1, h 2, h 3 ) maximizing probability τ ijk = log P((X 1, X 2, X 3 ) = (h 1, h 2, h 3 ) (Y 1, Y 2, Y 3 ) = (i, j, k)) each vertex of the Newton polygon N (Φ ijk ) determines an inference function: (i, j, k) (h 1, h 2, h 3 ) that realize min τ ijk