Geometry of Phylogenetic Inference

Similar documents
PHYLOGENETIC ALGEBRAIC GEOMETRY

Open Problems in Algebraic Statistics

Algebraic Statistics Tutorial I

Asymptotic Approximation of Marginal Likelihood Integrals

Language Acquisition and Parameters: Part II

Secant varieties of toric varieties

Phylogenetic Algebraic Geometry

Algebraic Classification of Small Bayesian Networks

Combinatorics and geometry of E 7

Dissimilarity maps on trees and the representation theory of GL n (C)

Tropical Varieties. Jan Verschelde

Combinatorial Types of Tropical Eigenvector

Using algebraic geometry for phylogenetic reconstruction

Discriminants, resultants, and their tropicalization

Tropical Geometry Homework 3

Varieties of Signature Tensors Lecture #1

ALGEBRA: From Linear to Non-Linear. Bernd Sturmfels University of California at Berkeley

A new parametrization for binary hidden Markov modes

Toric Varieties in Statistics

Tropical Islands. Jan Verschelde

Lecture 1. Toric Varieties: Basics

ON THE RANK OF A TROPICAL MATRIX

Example: Hardy-Weinberg Equilibrium. Algebraic Statistics Tutorial I. Phylogenetics. Main Point of This Tutorial. Model-Based Phylogenetics

Is Every Secant Variety of a Segre Product Arithmetically Cohen Macaulay? Oeding (Auburn) acm Secants March 6, / 23

arxiv: v1 [cs.cl] 5 Dec 2017

PRIMARY DECOMPOSITION FOR THE INTERSECTION AXIOM

Toric Fiber Products

A tropical approach to secant dimensions

E. GORLA, J. C. MIGLIORE, AND U. NAGEL

Linear Dynamical Systems (Kalman filter)

TORIC REDUCTION AND TROPICAL GEOMETRY A.

Algebraic combinatorics for computational biology. Nicholas Karl Eriksson. B.S. (Massachusetts Institute of Technology) 2001

ELIZABETH S. ALLMAN and JOHN A. RHODES ABSTRACT 1. INTRODUCTION

Testing Algebraic Hypotheses

Tropicalizations of Positive Parts of Cluster Algebras The conjectures of Fock and Goncharov David Speyer

5. Grassmannians and the Space of Trees In this lecture we shall be interested in a very particular ideal. The ambient polynomial ring C[p] has ( n

Combinatorial Aspects of Tropical Geometry and its interactions with phylogenetics

BOUNDS ON THE NUMBER OF INFERENCE FUNCTIONS OF A GRAPHICAL MODEL

arxiv: v1 [q-bio.pe] 23 Nov 2017

Lecture 3: Tropicalizations of Cluster Algebras Examples David Speyer

Homogeneous Coordinate Ring

Toric Ideals, an Introduction

Multivariate Gaussians, semidefinite matrix completion, and convex algebraic geometry

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

COMBINATORIAL SYMBOLIC POWERS. 1. Introduction The r-th symbolic power of an ideal I in a Notherian ring R is the ideal I (r) := R 1

Tropical Elliptic Curves and their j-invariant

(1) is an invertible sheaf on X, which is generated by the global sections

Tropical decomposition of symmetric tensors

Nonlinear Discrete Optimization

The Generalized Neighbor Joining method

Markov Chains and Hidden Markov Models

7. Shortest Path Problems and Deterministic Finite State Systems

x y = x + y. For example, the tropical sum 4 9 = 4, and the tropical product 4 9 = 13.

Problems on Minkowski sums of convex lattice polytopes

The partial-fractions method for counting solutions to integral linear systems

Computing toric degenerations of flag varieties

arxiv: v1 [quant-ph] 19 Jan 2010

Dissimilarity Vectors of Trees and Their Tropical Linear Spaces

T -equivariant tensor rank varieties and their K-theory classes

on Newton polytopes, tropisms, and Puiseux series to solve polynomial systems

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Introduction to Algebraic Statistics

26 : Spectral GMs. Lecturer: Eric P. Xing Scribes: Guillermo A Cidre, Abelino Jimenez G.

Tensor decomposition and moment matrices

MATH 323 Linear Algebra Lecture 12: Basis of a vector space (continued). Rank and nullity of a matrix.

When does a mixture of products contain a product of mixtures?

AN INTRODUCTION TO AFFINE TORIC VARIETIES: EMBEDDINGS AND IDEALS

Tropical Implicitization. Maria Angelica Cueto

Vector bundles in Algebraic Geometry Enrique Arrondo. 1. The notion of vector bundle

Why is Deep Learning so effective?

An introduction to tropical geometry

Tensor Networks in Algebraic Geometry and Statistics

AN INTRODUCTION TO TORIC SURFACES

INTRODUCTION TO TROPICAL ALGEBRAIC GEOMETRY

SECANT VARIETIES OF TORIC VARIETIES arxiv:math/ v4 [math.ag] 27 Apr 2006

SECANT DIMENSIONS OF LOW-DIMENSIONAL HOMOGENEOUS VARIETIES

The intersection axiom of

Algebraic Statistics progress report

Integer points enumerator of hypergraphic polytopes

Oeding (Auburn) tensors of rank 5 December 15, / 24

Tropical Algebraic Geometry

STA 414/2104: Machine Learning

Models of Language Evolution: Part II

HILBERT BASIS OF THE LIPMAN SEMIGROUP

arxiv:math/ v2 [math.ag] 17 May 2005

MATHEMATICAL ENGINEERING TECHNICAL REPORTS. Markov Bases for Two-way Subtable Sum Problems

Algebraic Complexity in Statistics using Combinatorial and Tensor Methods

Ranks of Real Symmetric Tensors

Counting matrices over finite fields

11 The Max-Product Algorithm

THE NEWTON POLYTOPE OF THE IMPLICIT EQUATION

The Geometry of Matrix Rigidity

Noetherianity up to symmetry

arxiv: v1 [math.ag] 19 May 2010

BMI/CS 776 Lecture 4. Colin Dewey

Triangulations and soliton graphs

The calibrated trifocal variety

MAXIMUM LIKELIHOOD ESTIMATION AND EM FIXED POINT IDEALS FOR BINARY TENSORS. Daniel Lemke

Identifiability of latent class models with many observed variables

STA 4273H: Statistical Machine Learning

Transcription:

Geometry of Phylogenetic Inference Matilde Marcolli CS101: Mathematical and Computational Linguistics Winter 2015

References N. Eriksson, K. Ranestad, B. Sturmfels, S. Sullivant, Phylogenetic algebraic geometry, in Projective varieties with unexpected properties, pp. 237 255, Walter de Gruyter, 2005. L. Pacher, B. Sturmfels, The Mathematics of Phylogenomics, SIAM Rev. 49 (2007), no. 1, 3 31. L. Pacher, B. Sturmfels, Tropical geometry of statistical models, Proc. Natl. Acad. Sci. USA 101 (2004), no. 46, 16132 16137 M. Drton, B. Sturmfels, S. Sullivant, Lectures on Algebraic Statistics, Birkhäuser, 2009.

Hidden Markov Models n observed states Y 1,..., Y n, each taking l possible values n hidden states X 1,..., X n, each taking k possible values conditional independence P(X i X 1,..., X i 1 ) = P(X i X i 1 ) P(Y i X 1,..., X i, Y 1,..., Y i 1 ) = P(Y i X i ) special case: all transitions X i 1 X i same k k-stochastic matrix P = (p ij ); all transitions X i Y i same k l-stochastic matrix T = (t ij )

a HMM described by the image of a polynomial map Φ : R k(k+1) R ln of degree n 1 bi-homogeneous in the coordinates p ij and t ij plus added positivity and normalization conditions (stochastic matrices and probability distributions) Example with k = l = 2 and n = 3, Φ = (Φ ijk ) : R 8 R 8 Φ ijk = p 00 p 00 t 0i t 0j t 0k + p 00 p 01 t 0i t 0j t 1k + p 01 p 10 t 0i t 1j t 0k + p 01 p 11 t 0i t 1j t 1k + p 10 p 00 t 1i t 0j t 0k + p 10 p 01 t 1i t 0j t 1k + p 11 p 10 t 1i t 1j t 0k + p 11 p 11 t 1i t 1j t 1k

invariants of the HMM: polynomial functions on R ln that vanish on the image of Φ ideal I Φ generated by invariants? small k, l, n Gröbner bases; larger computationally hard Questions Viterbi sequence: find the most likely hidden data given observed data find all parameter values for a model that result in the same observed distribution find what parameter-independent relations hold between the observed probabilities P i1,...,i n = Φ i1,...,i n

Phylogenetic Algebraic Geometry T a rooted binary tree with n leaves (hence 2n 2 edges) At each vertex a binary random variable (e.g. one of the syntactic parameters) Probability distribution at the root vertex π = (p, 1 p) Along each edge e transition matrix: stochastic matrix P e = (p (e) ij ) with i p(e) ij = 1 these represent the probabilities that a mutation in the parameter happens along that edge

Model Parameters the random variables at the leaves of the tree are observed; the random variables at the interior nodes are hidden (assuming no direct knowledge of the ancient languages in the family) matrix entries of transition matrices P e and probability π at root vertex are model parameters number of parameters N = (2n 2)k 2 + k (binary variable k = 2)

Polynomial Map at the n leaves there are k n = 2 n possible observations the probability of an observation at the leaves is a polynomial function of the parameters can view this as a complex polynomial Φ : C N C 2n plus some (real) normalization conditions polytope R N + C N determined by the conditions π 1 + π 2 = 1 and i p(e) ij = 1 with π i 0 and p (e) ij 0 Φ should map to a cube I n in C 2n where [0, 1] I C 2 is I = {(p 1, p 2 ) p 1 + p 2 = 1, p i 0}

Example Φ ijk = π 0 a 0i b 00 c 0j d 0k +π 0 a 0i b 01 c 1j d 1k +π 1 a 1i b 10 c 0j d 0k +π 1 a 1i b 11 c 1j d 1k there are 8 such polynomials: i, j, k {0, 1}

polynomial Φ is homogeneous in the parameters can view Φ as a map of projective spaces in the previous example Φ : C 4 C 4 C 4 C 4 C 2 C 8 Φ : P 3 (C) P 3 (C) P 3 (C) P 3 (C) P 1 (C) P 7 (C) homogeneous with respect to each group of variables a, b, c, d, π the fibers of this morphism give all possible values of parameters (before imposing real normalization conditions) that give a certain probability at the leaves

Algebraic varieties occurring in these models Toric varieties (including Segre varieties and Veronese varieties) Determinantal varieties: the tree structure imposes rank constraints on matrices built starting from observed probabilities at the leaves Example: Segre embedding P 1 P 1 P 1 P 1 P 15 p ijkl = u i v j w k x l i, j, k, l {0, 1}

Prime ideal defining this variety: generated by 2 2 minors of 4 4-matrices corresponding to

Secant variety of the Segre variety X nine-dimensional subvariety of P 15 given by al 2 2 2 2-tensors of rank at most 2 p ijkl = π 0 u 0i v 0j w 0k x 0l + π 1 u 1i v 1j w 1k w 1l Ideal generated by all 3 3-minors of previous matrices X = X (12)(34) X (13)(24) X (14)(23)

Determinantal variety each determinantal variety corresponds to a Markov model on one of the binary trees: X (12)(34) is defined by p ijkl = π 0 (a 00 u 0i v 0j + a 01 u 1i v 1j )(b 00 w 0k x 0l + b 01 w 1k x 1l ) + π 1 (a 10 u 0i v 0j + a 10 u 1i v 1j )(b 10 w 0k x 0l + b 11 w 1k x 1l ) this corresponds to vanishing of all 3 3-minors in first matrix stratification of P 2n 1 by phylogenetic models X

Special case: Jukes-Cantor model special case where all the edge matrices P e have the form ( ) p0 p P e = 1 p 1 p 0 it is known that in this case an explicit change of coordinates describes it as a toric variety. General Idea of Phylogenetic Algebraic Geometry generators of the ideal defining the complex variety = phylogenetic invariants which phylogenetic invariants suffice to distinguish between different Markov models? parameter inference from tropicalization of the algebraic variety

Tropical Semiring min-plus (or tropical) semiring T = R { }, with operations and given by x y = min{x, y}, with the identity element for and with with 0 the identity element for x y = x + y, operations and satisfy associativity and commutativity and distributivity of the product over the sum addition is no longer invertible and is idemponent x x = min{x, x} = x

Tropical polynomials function φ : R n R of the form φ(x 1,..., x n ) = m j=1a j x k j1 1 x k jn n = min{ a 1 + k 11 x 1 + + k 1n x n, a 2 + k 21 x 1 + + k 2n x n, a m + k m1 x 1 + + k mn x n }. tropicalization: algebraic varieties become piecewise linear spaces can recover information about a variety from its tropicalization

in the previous HMM example with n = 3 and k = l = 2 the tropicalization of the polynomials Φ ijk Φ ijk = p 00 p 00 t 0i t 0j t 0k + p 00 p 01 t 0i t 0j t 1k + p 01 p 10 t 0i t 1j t 0k + p 01 p 11 t 0i t 1j t 1k + p 10 p 00 t 1i t 0j t 0k + p 10 p 01 t 1i t 0j t 1k + p 11 p 10 t 1i t 1j t 0k + p 11 p 11 t 1i t 1j t 1k is given by τ ijk = min{u h1 h 2 + u h2 h 3 + v h1 i + v h2 j + v h3 k (h 1, h 2, h 3 ) {0, 1} 3 } where u ab = log(p ab ) and v ab = log(t ab ) Viterbi sequence: (h 1, h 2, h 3 ) realizing mimimum, given observed (i, j, k) is the Viterbi sequence of hidden data

Newton polytope polynomial f = ω Z n a ωx ω with x ω = x ω 1 1 x n ωn Newton polytope N (f ) = Convex Hull{ω Z n a ω 0} R n N (f + g) = N (f ) N (g) and N (f g) = N (f ) + N (g) (Minkowski sum of polytopes P + Q = {x + y x P, y Q} normal fan C(N (f )): normal cones of all faces C F (N (f )) C F (N (f )) = {w R n F = F w (N (f ))} F w (N (f )) = {x N (f ) (x y) w 0 y N (f )}

the set of parameters U = (u ab ), V = (v ab ) in tropicalization τ ijk of Φ ijk that determine the Viterbi sequence (h 1, h 2, h 3 ) is the normal cone to a vertex of the Newton polygon N (Φ ijk ) given observed data (i, j, k) and hidden data (h 1, h 2, h 3 ) the normal cones of N (Φ ijk ) give all parameter values for which (h 1, h 2, h 3 ) is the most likely explanation for the observed (i, j, k) domains of linearity of the piecewise linear tropical τ ijk are the cones in the normal fan C F (N (Φ ijk )); each maximal cone corresponds to one set of hidden data (h 1, h 2, h 3 ) maximizing probability τ ijk = log P((X 1, X 2, X 3 ) = (h 1, h 2, h 3 ) (Y 1, Y 2, Y 3 ) = (i, j, k)) each vertex of the Newton polygon N (Φ ijk ) determines an inference function: (i, j, k) (h 1, h 2, h 3 ) that realize min τ ijk