Introduction to Algebraic Statistics

Introduction to Algebraic Statistics Seth Sullivant North Carolina State University January 5, 2017 Seth Sullivant (NCSU) Algebraic Statistics January 5, 2017 1 / 28

What is Algebraic Statistics? Observation Statistical models are semialgebraic sets. Tools from algebraic geometry can uncover properties of statistical models. Algebro-geometric structure leads to new methods to analyze data. Algebraic properties of the model explain properties of inference procedures. Theoretical questions about statistical models lead to challenging problems in algebraic geometry. Seth Sullivant (NCSU) Algebraic Statistics January 5, 2017 2 / 28

Semialgebraic Sets R[p 1,..., p r ] = ring of polynomials in indeterminates p 1,..., p r with coefficients in R. Definition A basic semialgebraic set is the solution to finitely many polynomial equations and inequalities over R. That is, a set of the form: {a R r : f (a) = 0 for all f F, g(a) > 0 for all g G} where F, G R[p 1,..., p r ] are finite. A semialgebraic set is the union of finitely many basic semialgebraic sets. Semialgebraic sets are the basic objects of study in real algebraic geometry. Seth Sullivant (NCSU) Algebraic Statistics January 5, 2017 3 / 28

Semialgebraic Sets M 2 = {(p 0, p 1, p 2 ) R 3 : p 0 0, p 1 0, p 2 0, p 0 + p 1 + p 2 = 1, 4p 0 p 2 = p 2 1 } Theorem (Tarski-Seidenberg) The set of semialgebraic sets is closed under rational maps. That is, if S R r is a semialgebraic set, and φ : R r R s is a rational map then φ(s) R s is a semialgebraic set. Seth Sullivant (NCSU) Algebraic Statistics January 5, 2017 4 / 28

Statistical Models Definition A statistical model is a set of probability distributions. For this definition to be useful for us: Focus on a fixed state space. The set of distributions should be related to each other in a meaningful way. To keep things simple for this talk: State space will always be a finite set, like [r] := {1, 2,..., r}. In this case a probability distribution is just a point p R r : p = (p 1,..., p r ) = (P(X = 1),..., P(X = r)) R r satisfying r i=1 p i = 1 and p i 0 for all i [r]. Seth Sullivant (NCSU) Algebraic Statistics January 5, 2017 5 / 28

Example: Binomial Random Variable Model We have a biased coin with probability θ of getting a head. Flip the coin r times, and count the number of heads X. ( ) r P(X = i) = θ i (1 θ) r i i Binomial probability distribution is the point in R r+1. ((1 θ) r, rθ(1 θ) r 1,..., rθ r 1 (1 θ), θ r ) The binomial random variable model is the set {( ) } M r := (1 θ) r, rθ(1 θ) r 1,..., rθ r 1 (1 θ), θ r : θ [0, 1] M 2 = {(p 0, p 1, p 2 ) R 3 : p 0 0, p 1 0, p 2 0, p 0 + p 1 + p 2 = 1, 4p 0 p 2 = p1 2 } Seth Sullivant (NCSU) Algebraic Statistics January 5, 2017 6 / 28

Independence Model Definition Discrete random variables X and Y are independent if for all i and j Denote this X Y. P(X = i, Y = j) = P(X = i) P(Y = j) (1) Fix state space of X and Y as [r] and [s]. The model of independence M X Y Rr s is the set of all distribution satisfying (1). p 11 p 1s M X Y = p =..... : p satisfies (1) p r1 p rs Seth Sullivant (NCSU) Algebraic Statistics January 5, 2017 7 / 28

Independence Model Note that: means that p 11 p 1s p =..... p r1 p rs P(X = i, Y = j) = P(X = i) P(Y = j) = P(X = 1). P(X = r) ( P(Y = 1) P(Y = s) ) Proposition The independence model M X Y is the semialgebraic set {p R r s : p ij 0 for all i, j pij = 1, rank p = 1} Seth Sullivant (NCSU) Algebraic Statistics January 5, 2017 8 / 28

Statistical Models are Semialgebraic Sets Most statistical models are given parametrically as images of rational functions, where the domain is a polyhedron. φ : [0, 1] R r+1, θ ((1 ) θ) r, rθ(1 θ) r 1,..., rθ r 1 (1 θ), θ r Tarski-Seidenberg Theorem guarantees these models are semialgebraic sets. Problems to study with any family of models: Characterize the model implicitly. Characterize model properties. Use the algebraic structure to analyze procedures based on the model. Seth Sullivant (NCSU) Algebraic Statistics January 5, 2017 9 / 28

Phylogenetics Problem Given a collection of species, find the tree that explains their history. Data consists of aligned DNA sequences from homologous genes Human: Chimp: Gorilla: Seth Sullivant (NCSU)...ACCGTGCAACGTGAACGA......ACCTTGGAAGGTAAACGA......ACCGTGCAACGTAAACTA... Algebraic Statistics January 5, 2017 10 / 28

Model-Based Phylogenetics Use a probabilistic model of mutations Parameters for the model are the combinatorial tree T, and rate parameters for mutations on each edge Models give a probability for observing a particular aligned collection of DNA sequences Human: Chimp: Gorilla: ACCGTGCAACGTGAACGA ACGTTGCAAGGTAAACGA ACCGTGCAACGTAAACTA Assuming site independence, data is summarized by empirical distribution of columns in the alignment. e.g. ˆp(AAA) = 6 18, ˆp(CGC) = 2 18, etc. Use empirical distribution and test statistic to find tree best explaining data Seth Sullivant (NCSU) Algebraic Statistics January 5, 2017 11 / 28

Phylogenetic Models Assuming site independence: Phylogenetic Model is a latent class graphical model Leaf v T is random variable X v {A, C, G, T}. Internal nodes v T are latent random variables Y v X 1 X 2 X 3 Y 2 Y 1 P(x 1, x 2, x 3 ) = y 1 y 2 P 1 (y 1 )P 2 (y 2 y 1 )P 3 (x 1 y 1 )P 4 (x 2 y 2 )P 5 (x 3 y 2 ) Seth Sullivant (NCSU) Algebraic Statistics January 5, 2017 12 / 28

Phylogenetic Models Assuming site independence: Phylogenetic Model is a latent class graphical model Leaf v T is random variable X v {A, C, G, T}. Internal nodes v T are latent random variables Y v X 1 X 2 X 3 Y 2 Y 1 p i1 i 2 i 3 = j 1 j 2 π j1 a j2,j 1 b i1,j 1 c i2,j 2 d i3,j 2 Seth Sullivant (NCSU) Algebraic Statistics January 5, 2017 13 / 28

Algebraic Perspective on Phylogenetic Models Once we fix a tree T with n leaves and model structure, we get a map φ T : Θ R 4n. Θ R d is a parameter space of numerical parameters (transition matrices associated to each edge). The map φ T is given by polynomial functions of the parameters. For each i 1 i n {A, C, G, T } n, φ T i 1 i n (θ) gives the probability of the column (i 1,..., i n ) in the alignment for the particular parameter choice θ. φ T i 1 i 2 i 3 (π, a, b, c, d) = π j1 a j2,j 1 b i1,j 1 c i2,j 2 d i3,j 2 j 1 j 2 The phylogenetic model is the set M T = φ T (Θ) R 4n. Seth Sullivant (NCSU) Algebraic Statistics January 5, 2017 14 / 28

Phylogenetic Varieties and Phylogenetic Invariants Let R[p] := R[p i1 i n : i 1 i n {A, C, G, T } n ] Definition Let I T := f R[p] : f (p) = 0 for all p M T R[p]. I T is the ideal of phylogenetic invariants of T. Let V T := {p R 4n : f (p) = 0 for all f I T }. V T is the phylogenetic variety of T. Note that M T V T. Since M T is image of a polynomial map dim M T = dim V T. Seth Sullivant (NCSU) Algebraic Statistics January 5, 2017 15 / 28

X 1 X X X 2 3 4 p lmno = T T T π i a ij b ik c jl d jm e kn f ko i=a j=a k=a Seth Sullivant (NCSU) Algebraic Statistics January 5, 2017 16 / 28

X 1 X X X 2 3 4 p lmno = T T T π i a ij b ik c jl d jm e kn f ko i=a j=a k=a T a ij c jl d jm j=a Seth Sullivant (NCSU) Algebraic Statistics January 5, 2017 16 / 28

X 1 X X X 2 3 4 p lmno = T T T π i a ij b ik c jl d jm e kn f ko i=a j=a k=a T T a ij c jl d jm b ik e kn f ko j=a k =A Seth Sullivant (NCSU) Algebraic Statistics January 5, 2017 16 / 28

X 1 X X X 2 3 4 p lmno = = T T T π i a ij b ik c jl d jm e kn f ko i=a j=a k=a T T T a ij c jl d jm b ik e kn f ko π i i=a j=a k =A Seth Sullivant (NCSU) Algebraic Statistics January 5, 2017 16 / 28

X 1 X X X 2 3 4 p lmno = = T T T π i a ij b ik c jl d jm e kn f ko i=a j=a k=a T T T a ij c jl d jm b ik e kn f ko π i i=a j=a k =A p AAAA p AAAC p AATT p = rank ACAA p ACAC p ACTT...... 4 p TTAA p TTAC p TTTT Seth Sullivant (NCSU) Algebraic Statistics January 5, 2017 16 / 28

Splits and Phylogenetic Invariants Definition A split of a set is a bipartition A B. A split A B of the leaves of a tree T is valid for T if the induced trees T A and T B do not intersect. X 1 X X X 2 3 4 Valid: 12 34 Not Valid: 13 24 Seth Sullivant (NCSU) Algebraic Statistics January 5, 2017 17 / 28

2-way Flattenings and Matrix Ranks Proposition Let P M T. p ijkl = P(X 1 = i, X 2 = j, X 3 = k, X 4 = l) p AAAA p AAAC p AAAG p AATT p ACAA p ACAC p ACAG p ACTT Flat 12 34 (P) =....... p TTAA p TTAC p TTAG p TTTT If A B is a valid split for T, then rank(flat A B (P)) 4. Invariants in I T are subdeterminants of Flat A B (P). If C D is not a valid split for T, then generically rank(flat C D (P)) > 4. Seth Sullivant (NCSU) Algebraic Statistics January 5, 2017 18 / 28

Using Matrix Flattenings Proposition The set of valid splits for a given tree uniquely determine the tree. Proposition Let P M T. If A B is a valid split for T, then rank(flat A B (P)) 4. Invariants in I T are subdeterminants of Flat A B (P). If C D is not a valid split for T, then generically rank(flat C D (P)) > 4. Can use flattening invariants to reconstruct phylogenetic trees (Eriksson 2005, Chifman-Kubatko 2014, Allman-Kubatko-Rhodes 2016) via the SVD. Seth Sullivant (NCSU) Algebraic Statistics January 5, 2017 19 / 28

Identifiability of Phylogenetic Models Definition A parametric statistical model is identifiable if it gives 1-to-1 map from parameters to probability distributions. Is it possible to infer the parameters of the model from data? Identifiability guarantees consistency of statistical methods (ML) Two types of parameters to consider for phylogenetic models: Numerical parameters (transition matrices) Tree parameter (combinatorial type of tree) Seth Sullivant (NCSU) Algebraic Statistics January 5, 2017 20 / 28

Generic Identifiability Definition The tree parameter in a phylogenetic model is generically identifiable if for all n-leaf trees with T T, dim(m T M T ) < min(dim(m T ), dim(m T )). M1 M2 M3 Seth Sullivant (NCSU) Algebraic Statistics January 5, 2017 21 / 28

Proving Identifiability with Algebraic Geometry Proposition Let M 0 and M 1 be two algebraic models. If there exist phylogenetically informative invariants f 0 and f 1 such that f i (p) = 0 for all p M i, and f i (q) 0 for some q M 1 i, then dim(m 0 M 1 ) < min(dim M 0, dim M 1 ). M0 f = 0 q M 1 Seth Sullivant (NCSU) Algebraic Statistics January 5, 2017 22 / 28

Phylogenetic Models are Identifiable Theorem The unrooted tree parameter of phylogenetic models is generically identifiable. Proof. Edge flattening invariants can detect which splits are implied by a specific distribution in M T. The splits in T uniquely determine T. These methods can be generalized to much more complicated settings: Phylogenetic mixture models (Allman-Petrovic-Rhodes-Sullivant 2009, Rhodes-Sullivant 2012, Long-Sullivant 2015) Species tree problems (Allman-Degnan-Rhodes 2011, Chifman-Kubatko 2014) K-mer methods (Allman-Rhodes-Sullivant 2016, Durden-Sullivant 201?) Seth Sullivant (NCSU) Algebraic Statistics January 5, 2017 23 / 28

Maximum Likelihood Estimation Let M R r a statistical model for discrete random variables. We collect i.i.d. data, X (1),..., X (n), we believe to be generated from a distribution in M. Let u = (u 1,..., u r ) be the vector of counts: u i = #{j : X (j) = i} Given the data u, the likelihood function is the monomial p u = p u 1 1 pu 2 2 pur r which is the probability of observing the data when P(X = i) = p i. The maximum likelihood estimate (MLE) of p given u is: argmax p u such that p M Seth Sullivant (NCSU) Algebraic Statistics January 5, 2017 24 / 28

Maximum Likelihood Degree Computing the maximum likelihood estimate argmax p u such that p M is an optimization problem of a polynomial constrained to a semialgebraic set. We can extended to the Zariski closure V = M, in which case the optimum should be a critical point of the likelihood function (for generic u). Theorem (Catanese-Hoşten-Khetan-Sturmfels 2006) The number of complex critical points of the likelihood function on V sm = V \ V sing is constant for generic u. This number if called the maximum likelihood degree of the variety V (or model M). Seth Sullivant (NCSU) Algebraic Statistics January 5, 2017 25 / 28

Independence Model Consider M X Y. Maximum likelihood estimation is: argmax i,j p u ij ij such that p M X Y But p M X Y iff there are α r, β s with p ij = α i β j. Becomes: argmax α u i+ i β u +j i such that α r, β s i j Maximum likelihood estimate is ˆp ij = u i+u +j u 2 ++ = u i+ u ++ u+j u ++ M X Y has ML-degree 1 (= MLE is rational function of the data). Seth Sullivant (NCSU) Algebraic Statistics January 5, 2017 26 / 28

ML-degree and Algebraic Geometry Theorem (Huh 2014) A variety V has maximum likelihood degree 1 if and only if it is a reduced A-discriminant. Theorem (Huh 2012) Suppose that V (C ) r is a smooth variety. Then the ML-degree of V is the reduced Euler characteristic of V (C ) r. When is a reduced A-discriminant a statistical model? Does large ML-degree affect algorithms for computing MLEs? Seth Sullivant (NCSU) Algebraic Statistics January 5, 2017 27 / 28

Further Applications of Algebraic Statistics Fisher s Exact Test (Diaconis-Sturmfels 1998) (Non)existence of maximum likelihood estimates (Fienberg) Bayesian integrals and information criteria (Watanabe 2009) Conditional independence structures (Studeny 2004) Graphical models (Lauritzen 1996) Parametric inference (Pachter-Sturmfels 2004) Learn More: Algebraic Statistics the book Draft version available at my website. Comments welcome! Seth Sullivant (NCSU) Algebraic Statistics January 5, 2017 28 / 28