Example: Hardy-Weinberg Equilibrium. Algebraic Statistics Tutorial I. Phylogenetics. Main Point of This Tutorial. Model-Based Phylogenetics

Similar documents
Algebraic Statistics Tutorial I

Phylogenetic Algebraic Geometry

Introduction to Algebraic Statistics

Toric Fiber Products

Phylogenetic invariants versus classical phylogenetics

mathematics Smoothness in Binomial Edge Ideals Article Hamid Damadi and Farhad Rahmati *

Using algebraic geometry for phylogenetic reconstruction

Total binomial decomposition (TBD) Thomas Kahle Otto-von-Guericke Universität Magdeburg

Geometry of Phylogenetic Inference

PRIMARY DECOMPOSITION FOR THE INTERSECTION AXIOM

Identifiability of the GTR+Γ substitution model (and other models) of DNA evolution

Toric Varieties in Statistics

PHYLOGENETIC ALGEBRAIC GEOMETRY

Open Problems in Algebraic Statistics

arxiv: v4 [math.ac] 31 May 2012

Markov bases and subbases for bounded contingency tables

The binomial ideal of the intersection axiom for conditional probabilities

Veronesean almost binomial almost complete intersections

Commuting birth-and-death processes

Metric learning for phylogenetic invariants

Species Tree Inference using SVDquartets

MATHEMATICAL ENGINEERING TECHNICAL REPORTS. Markov Bases for Two-way Subtable Sum Problems

Tutorial: Gaussian conditional independence and graphical models. Thomas Kahle Otto-von-Guericke Universität Magdeburg

arxiv: v1 [q-bio.pe] 23 Nov 2017

Binomial Exercises A = 1 1 and 1

MCS 563 Spring 2014 Analytic Symbolic Computation Monday 14 April. Binomial Ideals

The Generalized Neighbor Joining method

arxiv: v2 [math.ac] 15 May 2010

Algebraic Invariants of Phylogenetic Trees

The Maximum Likelihood Threshold of a Graph

Some Algebraic and Combinatorial Properties of the Complete T -Partite Graphs

Lecture 1. Toric Varieties: Basics

Toric ideals finitely generated up to symmetry

Algebraic Varieties. Notes by Mateusz Micha lek for the lecture on April 17, 2018, in the IMPRS Ringvorlesung Introduction to Nonlinear Algebra

Distance-reducing Markov bases for sampling from a discrete sample space

Is Every Secant Variety of a Segre Product Arithmetically Cohen Macaulay? Oeding (Auburn) acm Secants March 6, / 23

ALGEBRA: From Linear to Non-Linear. Bernd Sturmfels University of California at Berkeley

Combinatorics and geometry of E 7

HILBERT BASIS OF THE LIPMAN SEMIGROUP

Toric statistical models: parametric and binomial representations

Jed Chou. April 13, 2015

Undirected Graphical Models

USING INVARIANTS FOR PHYLOGENETIC TREE CONSTRUCTION

Algebraic Complexity in Statistics using Combinatorial and Tensor Methods

Indispensable monomials of toric ideals and Markov bases

ELIZABETH S. ALLMAN and JOHN A. RHODES ABSTRACT 1. INTRODUCTION

Minimal basis for connected Markov chain over 3 3 K contingency tables with fixed two-dimensional marginals. Satoshi AOKI and Akimichi TAKEMURA

Phylogeny of Mixture Models

5. Grassmannians and the Space of Trees In this lecture we shall be interested in a very particular ideal. The ambient polynomial ring C[p] has ( n

Homotopy Techniques for Tensor Decomposition and Perfect Identifiability

On a perturbation method for determining group of invariance of hierarchical models

19 Tree Construction using Singular Value Decomposition

Noetherianity up to symmetry

Maximum Likelihood Estimates for Binary Random Variables on Trees via Phylogenetic Ideals

INITIAL COMPLEX ASSOCIATED TO A JET SCHEME OF A DETERMINANTAL VARIETY. the affine space of dimension k over F. By a variety in A k F

arxiv: v1 [q-bio.pe] 27 Oct 2011

Combinatorics for algebraic geometers

Acyclic Digraphs arising from Complete Intersections

Geometry of Gaussoids

Toric Varieties. Madeline Brandt. April 26, 2017

NORMS ON SPACE OF MATRICES

On the extremal Betti numbers of binomial edge ideals of block graphs

The intersection axiom of

Reconstruire le passé biologique modèles, méthodes, performances, limites

DISCRETIZED CONFIGURATIONS AND PARTIAL PARTITIONS

Combinatorial Aspects of Tropical Geometry and its interactions with phylogenetics

Rigid monomial ideals

Tropical decomposition of symmetric tensors

Algebraic Classification of Small Bayesian Networks

Pitfalls of Heterogeneous Processes for Phylogenetic Reconstruction

When Do Phylogenetic Mixture Models Mimic Other Phylogenetic Models?

The combinatorics of binomial ideals

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms For Inference Fall 2014

arxiv: v1 [math.ra] 13 Jan 2009

CSCI1950 Z Computa4onal Methods for Biology Lecture 5

Introduction to Groups

Evaluate algebraic expressions for given values of the variables.

ON CERTAIN CLASSES OF CURVE SINGULARITIES WITH REDUCED TANGENT CONE

COMBINATORIAL SYMBOLIC POWERS. 1. Introduction The r-th symbolic power of an ideal I in a Notherian ring R is the ideal I (r) := R 1

Boij-Söderberg Theory

Betti Numbers of Splittable Graphs

arxiv: v1 [math.ac] 31 Oct 2015

Week 15-16: Combinatorial Design

Nonlinear Discrete Optimization

arxiv:math/ v1 [math.st] 23 May 2006

Polytopes and Algebraic Geometry. Jesús A. De Loera University of California, Davis

Pitfalls of Heterogeneous Processes for Phylogenetic Reconstruction

arxiv: v1 [math.co] 10 Aug 2016

arxiv: v2 [math.ac] 17 Apr 2013

ALGEBRAIC AND COMBINATORIAL PROPERTIES OF CERTAIN TORIC IDEALS IN THEORY AND APPLICATIONS

Problems on Minkowski sums of convex lattice polytopes

ACYCLIC DIGRAPHS GIVING RISE TO COMPLETE INTERSECTIONS

A Survey of Parking Functions

LOVÁSZ-SAKS-SCHRIJVER IDEALS AND COORDINATE SECTIONS OF DETERMINANTAL VARIETIES

1 Adjacency matrix and eigenvalues

An introduction to Hodge algebras

Mixed-up Trees: the Structure of Phylogenetic Mixtures

Exterior powers and Clifford algebras

Ideals and graphs, Gröbner bases and decision procedures in graphs

Transcription:

Example: Hardy-Weinberg Equilibrium Algebraic Statistics Tutorial I Seth Sullivant North Carolina State University July 22, 2012 Suppose a gene has two alleles, a and A. If allele a occurs in the population with frequency θ (and A with frequency 1 θ) and these alleles are in Hardy-Weinberg equilibrium, the genotype frequencies are P(X = aa)=θ 2, P(X = aa)=2θ(1 θ), P(X = AA)=(1 θ) 2 The model of Hardy-Weinberg equilibrium is the set M = θ 2, 2θ(1 θ),(1 θ) 2 θ [0, 1] 3 I(M)=p aa +p aa +p AA 1, p 2 aa 4p aap AA Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 1 / 32 Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 2 / 32 Main Point of This Tutorial Phylogenetics Given a collection of species, find the tree that explains their history. Many statistical models are described by (semi)-algebraic constraints on a natural parameter space. Generators of the vanishing ideal can be useful for constructing algorithms or analyzing properties of statistical model. Two Examples Phylogenetic Algebraic Geometry Sampling Contingency Tables Data consists of aligned DNA sequences from homologous genes Human: Chimp: Gorilla:...ACCGTGCAACGTGAACGA......ACCTTGGAAGGTAAACGA......ACCGTGCAACGTAAACTA... Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 3 / 32 Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 4 / 32 Model-Based Phylogenetics Use a probabilistic model of mutations Parameters for the model are the combinatorial tree T, and rate parameters for mutations on each edge Models give a probability for observing a particular aligned collection of DNA sequences Phylogenetic Models Assuming site independence: Phylogenetic Model is a latent class graphical model Vertex v T gives a random variable X v {A,C,G,T} All random variables corresponding to internal nodes are latent Human: Chimp: Gorilla: ACCGTGCAACGTGAACGA ACGTTGCAAGGTAAACGA ACCGTGCAACGTAAACTA X 1 X 2 X 3 Y 2 Assuming site independence, data is summarized by empirical distribution of columns in the alignment. e.g. ˆp(AAA)= 6 18, ˆp(CGC)= 2 18, etc. Use empirical distribution and test statistic to find tree best explaining data P(x 1, x 2, x 3 )= y 1 Y 1 y 2 P(y 1 )P(y 2 y 1 )P(x 1 y 1 )P(x 2 y 2 )P(x 3 y 2 ) Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 5 / 32 Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 6 / 32

Phylogenetic Models Assuming site independence: Phylogenetic Model is a latent class graphical model Vertex v T gives a random variable X v {A,C,G,T} All random variables corresponding to internal nodes are latent p i1 i 2 i 3 = j 1 X 1 X 2 X 3 Y 1 Y 2 j 2 π j1 a j2,j 1 b i1,j 1 c i2,j 2 d i3,j 2 Algebraic Perspective on Phylogenetic Models Once we fix a tree T and model structure, we get a map φ T : Θ R 4n. Θ R d is a parameter space of numerical parameters (transition matrices associated to each edge). The map φ T is given by polynomial functions of the parameters. For each i 1 i n {A, C, G, T} n, φ T i 1 in (θ) gives the probability of the column (i 1,...,i n ) in the alignment for the particular parameter choice θ. φ T i 1 i 2 i 3 (π, a, b, c, d)= π j1 a j2,j 1 b i1,j 1 c i2,j 2 d i3,j 2 j 1 j 2 The phylogenetic model is the set M T = φ T (Θ) R 4n. Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 7 / 32 Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 8 / 32 Phylogenetic Varieties and Phylogenetic Invariants X 1 X 2 X3 X4 Let R[p]:=R[p i1 in : i 1 i n {A, C, G, T} n ] Let I T := f R[p]:f(p)=0 for all p M T R[p]. I T is the ideal of phylogenetic invariants of T. Let V T := {p R 4n : f(p)=0 for all f I T }. V T is the phylogenetic variety of T. Note that M T V T. Since M T is image of a polynomial map dimm T = dim V T. Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 9 / 32 p lmno = π i a ij b ik c jl d jm e kn f ko i=1 j=1 k=1 4 4 4 = π i a ij c jl d jm b ik e kn f ko i=1 j=1 k=1 p 1111 p 1112 p 1144 p = rank 1211 p 1212 p 1244...... 4 p 4411 p 4412 p 4444 Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 10 / 32 Splits and Phylogenetic Invariants 2-way Flattenings and Matrix Ranks A split of a set is a bipartition A B. A split A B of the leaves of a tree T is valid for T if the induced trees T A and T B do not intersect. X 1 X 2 X3 X4 Valid: 12 34 Not Valid: 13 24 Proposition Let P M T. p ijkl = P(X 1 = i, X 2 = j, X 3 = k, X 4 = l) p AAAA p AAAC p AAAG p AATT p ACAA p ACAC p ACAG p ACTT Flat 12 34 (P)=....... p TTAA p TTAC p TTAG p TTTT If A B is a valid split for T, then rank(flat A B (P)) 4. Invariants in I T are subdeterminants of Flat A B (P). If C D is not a valid split for T, then generically rank(flat C D (P)) > 4. Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 11 / 32 Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 12 / 32

Phylogenetic Algebraic Geometry Using Phylogenetic Invariants to Reconstruct Trees Phylogenetic Algebraic Geometry is the study of the phylogenetic varieties and ideals V T and I T. Using Phylogenetic Invariants to Reconstruct Trees Identifiability of Phylogenetic Models Interesting Math Useful in Other s A phylogenetic invariant f I T is phylogenetically informative if there is some other tree T such that f / I T. Idea of Cavender-Felsenstein (1987), Lake (1987): Evaluate phylogenetically informative phylogenetic invariants at empirical distribution ˆp to reconstruct phylogenetic trees Proposition For each n-leaf trivalent tree T, let F T I T be a set of phylogenetic invariants such that, for each T = T, there is an f F T, such that f / I T. Let f T := f F T f. Then for generic p M T,f T (p)=0 if and only if p M T. Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 13 / 32 Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 14 / 32 Performance of Invariants Methods in Simulations Identifiability of Phylogenetic Models Huelsenbeck (1995) did a systematic simulation comparison of 26 different methods of constructing a phylogenetic tree on 4 leaf trees. Invariant-based methods did poorly. HOWEVER... Huelsenbeck only used linear invariants. Casanellas, Fernandez-Sanchez (2006) redid these simulations using a generating set of the phylogenetic ideal I T. Phylogenetic invariants become comparable to other methods. For the particular model studied in Casanellas, Fernandez-Sanchez (2006) for a tree with 4 leaves, the ideal I T has 8002 generators. f T := f F T f is a sum of 8002 terms. Major work to overcome combinatorial explosion for larger trees. A parametric statistical model is identifiable if it gives 1-to-1 map from parameters to probability distributions. Is it possible to infer the parameters of the model from data? Identifiability guarantees consistency of statistical methods (ML) Two types of parameters to consider for phylogenetic models: Numerical parameters (transition matrices) Tree parameter (combinatorial type of tree) Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 15 / 32 Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 16 / 32 Geometric Perspective on Identifiability Generic Identifiability The unrooted tree parameter T in a phylogenetic model is identifiable if for all p M T there does not exist another T = T such that p M T. The tree parameter in a phylogenetic model is generically identifiable if for all n-leaf trees with T = T, dim(m T M T ) < min(dim(m T ), dim(m T )). M1 M2 M1 M2 M1 M2 M3 M3 Identifiable Not Identifiable Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 17 / 32 Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 18 / 32

Proving Identifiability with Algebraic Geometry Phylogenetic Models are Identifiable Proposition Let M 0 and M 1 be two algebraic models. If there exist phylogenetically informative invariants f 0 and f 1 such that f i (p)=0 for all p M i, and f i (q) = 0 for some q M 1 i, then dim(m 0 M 1 ) < min(dimm 0, dimm 1 ). M0 f = 0 M 1 Theorem The unrooted tree parameter of phylogenetic models is generically identifiable. Proof. Edge flattening invariants can detect which splits are implied by a specific distribution in M T. The splits in T uniquely determine T. Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 19 / 32 Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 20 / 32 Phylogenetic Mixture Models Basic phylogenetic model assume same parameters at every site This assumption is not accurate within a single gene Some sites more important: unlikely to change Tree structure may vary across genes Leads to mixture models for different classes of sites M(T, r) denotes a same tree mixture model with underlying tree T and r classes of sites Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 21 / 32 Identifiability Questions for Mixture Models Question For fixed number of trees r, are the tree parameters T 1,...,T r, and rate parameters of each tree (generically) identified in phylogenetic mixture models? r = 1 (Ordinary phylogenetic models) Most models are identifiable on 2, 3, 4 leaves. ( Rogers, Chang, Steel, Hendy, Penny, Székely, Allman, Rhodes, Housworth,...) r > 1 T 1 = T 2 = = T r but no restriction on number of trees Not identifiable (Matsen-Steel, Stefankovic-Vigoda) r > 1, T i arbitrary Not identifiable (Mossel-Vigoda) Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 22 / 32 How to Construct Phylogenetic Invariants? Theorem (Rhodes-Sullivant 2011) The unrooted tree and numerical parameters in a r-class, same tree phylogenetic mixture model on n-leaf trivalent trees are generically identifiable, if r < 4 n/4. Proof Ideas. Phylogenetic invariants from flattenings Tensor rank (Kruskal s Theorem) [Allman-Matias-Rhodes 2009] Elementary tree combinatorics Solving tree and numerical parameter identifiability at the same time Theorem (Sturmfels-S, Allman-Rhodes, Casanellas-S, Draisma-Kuttler) Consider nice algebraic phylogenetic model. The problem of computing phylogenetic invariants for any tree T can be reduced to the same problem for star trees K 1,k. The ideal I T generated by local contributions from each K 1,k, plus flattening invariants from edges. The varieties V K1,k are interesting classical algebraic varieties: toric varieties secant varieties Sec 4 (P 3 P 3 P 3 ) Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 23 / 32 Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 24 / 32

Group-based models Equations for the CFN Model αβ βα αβββ βαββ ββαβ βββα αβγγ βαγγ γ γ α β γ γ β α αβγδ βαδγ γ δ α β δ γ β α Theorem (Sturmfels-S 2005) For any tree T, the toric ideal I T for the CFN model is generated by degree 2 determinantal equations. Random variables in finite abelian group G. Transitions probabilities satisfy Prob(X = g Y = h)= f(g + h). This means that the formula for Prob(X 1 = g 1,...,X n = g n ) is a convolution (over G n ). Apply discrete Fourier transform to turn convolution into a product. Theorem (Hendy-Penny 1993, Evans-Speed 1993) In the Fourier coordinates, a group-based model is parametrized by monomial functions in terms of the Fourier parameters. In particular, the CFN model is a toric variety. X 1 X 3 X 4 X 2 Fourier coordinates: q0000 q 0001 q 0010 q 0011 q 1100 q 1101 q 1110 q 1111 q0100 q 0101 q 0110 q 0111 q 1000 q 1001 q 1010 q 1011 q lmno = r,s,t,u {0,1} ( 1)rl+sm+tn+uo p rstu I T generated by 2 2 minors of: q 0000 q 0011 q 0001 q 0010 q 0100 q 0111 q 0101 q 0110 q 1000 q 1011 q 1001 q 1010 q 1100 q 1111 q 1101 q 1110 Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 25 / 32 Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 26 / 32 Gluing Two Trees at a Leaf Gluing more complex graphs + = + = Let T = T 1 #T 2, tree obtained by joining two trees at a leaf. Each ring C[p]/I T1, C[p]/I T2 is invariant under action of group G = Gl r (C) k acting on the glue leaves. Theorem (Draisma-Kuttler) C[p]/I T = (C[p]/IT1 C C[p]/I T2 ) G V T =(V T1 V T2 )//G (GIT quotient) Actions of individual factors (Gl r (C)) do no interact. Use Reynolds operator and first fundamental theorem of CIT. Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 27 / 32 Still a group action (Gl r (C) k ). But factors are not acting independently. C[p]/I G = (C[p]/I G1 C C[p]/I G2 ) G C[p]/I G generated by degree 1 part of (C[p]/I G1 C C[p]/I G2 ) G (toric fiber product if r = 1) Theorem (Engström-Kahle-S) Can determine generators of I G from I G1 and I G2 if the TFP has low codimension. Useful for other problems in algebraic statistics. Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 28 / 32 Summary: Phylogenetic Algebraic Geometry Phylogenetic models are fundamentally algebraic-geometric objects. Algebraic perspective is useful for: Developing new construction algorithms Proving theorems about identifiability (currently best available for mixture models) Leads to interesting new mathematics, useful for other problems Long way to go: Your Help Needed! s Theorem (Allman-Rhodes 2006) Let T be a trivalent tree with n leaves, and consider the general Markov model on binary characters. The phylogenetic ideal I T has generating set {3 3 minors of Flat A B (P)} A B Σ(T) where Σ(T) is the set of all valid splits on T. Note that P is a 2 2 2, n-way tensor. For the 5 leaf tree at the right and write down all the matrices Flat A B (P) that are needed in the previous theorem. 5 4 3 Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 29 / 32 Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 30 / 32

References E. Allman, C. Matias, J. Rhodes. Identifiability of parameters in latent structure models with many observed variables. Annals of Statistics, 37 no.6a (2009) 3099-3132. M. Casanellas, J. Fernandez-Sanchez. Performance of a new invariants method on homogeneous and non-homogeneous quartet trees, Molecular Biology and Evolution, 24(1):288-293, 2006. J. Cavender, J. Felsenstein. Invariants of phylogenies: a simple case with discrete states. Journal of Classification 4 (1987) 57-71. J. Draisma, J Kuttler. On the ideals of equivariant tree models, Mathematische Annalen 344(3):619-644, 2009. A. Engström, T. Kahle, S. Sullivant. Multigraded commutative algebra of graph decompositions.1102.2601 S. Evans and T. Speed. Invariants of some probability models used in phylogenetic inference. Annals of Statistics 21 (1993) 355-377. F.A. Matsen and M. Steel. Phylogenetic mixtures on a single tree can mimic a tree of another topology. Systematic Biology, 2007. E. Mossel and E. Vigoda Phylogenetic MCMC Are Misleading on Mixtures of Trees. Science 309, 2207 2209 (2005) J. Rhodes, S. Sullivant. Identifiability of large phylogenetic mixture models. To appear Bulletin of Mathematical Biology, 2011.1011.4134 D. Stefankovic and E. Vigoda. Pitfalls of Heterogeneous Processes for Phylogenetic Reconstruction Systematic Biology 56(1): 113-124, 2007. B. Sturmfels, S. Sullivant. Toric ideals of phylogenetic invariants. Journal of Computational Biology 12 (2005) 204-228. M. Hendy, D. Penny. Spectral analysis of phylogenetic data. J. Classification 10 (1993) 5 24. J. Huelsenbeck. Performance of phylogenetic methods in simulation. Systematic Biology 42 no.1 (1995) 17 48. J. Lake. A rate-independent technique for analysis of nucleaic acid sequences: evolutionary parsimony. Molecular Biology and Evolution 4 (1987) 167-191. Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 31 / 32 Seth Sullivant (NCSU) Algebraic Statistics July 22, 2012 32 / 32

Generating Random Tables Algebraic Statistics Tutorial II Generate a random table from the set of all nonnegative k 1 k 2 integer tables with given row and column sums. Seth Sullivant North Carolina State University June 10, 2012 c 1 c 2 c 3 c 4 r 1 r 2 r 3 Fisher s Exact Test, Missing Data s Seth Sullivant (NCSU) Algebraic Statistics June 10, 2012 1 / 28 Seth Sullivant (NCSU) Algebraic Statistics June 10, 2012 2 / 28 Random Walk Connecting Lattice Points in Polytopes 2 2 2 6 2 2 2 6 3 2 1 6 3 6 + + 1 0 1 0 1 0 1 0 1 1 0 0 1 1 0 0 = = 3 2 1 6 3 6 4 1 1 6 0 3 3 6 Let A : Z n Z d a linear transformation, b Z d. A 1 [b]:={x N n : Ax = b} (fiber) B ker Z A Let A 1 [b] B be the graph with vertex set A 1 [b] and u v an edge if and only u v ±B. 1 1 0 1 1 0, 1 0 1 1 0 1, 0 1 1 0 1 1 allow for a connected random walk over these contingency tables. Given A and b, find finite B ker Z A such that A 1 [b] B is connected. If B ker Z A is a set such that A 1 [b] B is connected for all b, then B is a Markov basis for A. Seth Sullivant (NCSU) Algebraic Statistics June 10, 2012 3 / 28 Seth Sullivant (NCSU) Algebraic Statistics June 10, 2012 4 / 28 Example: 2-way tables Let A : Z k 1 k 2 Z k 1+k 2 such that m m k k A(u) = u 1j,..., u k1 j; u i1,..., u ik2 j=1 j=1 i=1 i=1 = vector of row and column sums of u ker Z (A)={u Z k 1 k 2 : row and columns sums of u are 0} Markov basis consists of the 2 k 1 k2 2 2 moves like: 0 1 0 1 0 1 0 1 0 3-way tables Let A : Z k 1 k 2 k 3 Z k 1 k 2 +k 1 k 3 +k 2 k 3 be the linear transformation such that A(u) = ( i3 u i1 i 2 i 3 ) i1,i 2 ;( u i1 i 2 i 3 ) i1 i 3 ;( u i1 i 2 i 3 ) i2,i 3 i 2 i 1 = all 2-way margins of 3-way table u = all line sums of u. Markov basis depends on k 1, k 2, k 3, contains moves like: but also non-obvious moves like: 1 1 1 1 0 0 0 1 1 0 0 0 1 1 0 1 1 1 1 1 1 1 1 1 1 0 1 0 1 0 1 1 0 1 1 0 1 1 0 1 1 Seth Sullivant (NCSU) Algebraic Statistics June 10, 2012 5 / 28 Seth Sullivant (NCSU) Algebraic Statistics June 10, 2012 6 / 28

Fundamental Theorem of Markov Bases Let A : Z n Z d. The toric ideal I A is the ideal p u p v : u, v N n, Au = Av K[p 1,...,p n ], where p u = p u 1 1 pu 2 2 pun n. Theorem (Diaconis-Sturmfels 1998) The set of moves B ker Z A is a Markov basis for A if and only if the set of binomials {p b+ p b : b B} generates I A. 0 1 0 1 0 p 21 p 33 p 23 p 31 1 0 1 0 Toric Varieties = Log-linear Models The variety V A = V(I A ) is a toric variety. The statistical model M A = V(I A ) m is a log-linear model. M A = {p m : log p rowspan A}. Fisher s exact test: Does the data u fit the model M A? u/ u Seth Sullivant (NCSU) Algebraic Statistics June 10, 2012 7 / 28 Seth Sullivant (NCSU) Algebraic Statistics June 10, 2012 8 / 28 2-way tables: Independence Computing Markov Bases 0 1 0 1 0 p 21 p 33 p 23 p 31 = p 21 p 23 p 1 0 1 0 31 p 33 p 11 p 12 p 1k2 p 21 p 22 p 2k2 I A = 2 2 minors of...... p k1 1 p k p k1 k 2 V A = V(I A )={P R k 1 k 2 : rank P 1} Software 4ti2 www.4ti2.de Macaulay2 (4ti2 interface) http://www.math.uiuc.edu/macaulay2/ Singular (toric package)http://www.singular.uni-kl.de/ Theory Gluing Results Finiteness Theorems Special Configurations M A = V A k1 k 2 = M X1 X 2 Seth Sullivant (NCSU) Algebraic Statistics June 10, 2012 9 / 28 Seth Sullivant (NCSU) Algebraic Statistics June 10, 2012 10 / 28 No Hope Theorem Theorem (De Loera-Onn (2006)) Every integer vector appears as part of a minimal Markov basis element for 3 k 2 k 3 tables (with fixed 2-way margins). In particular, minimal Markov basis elements for 3-way tables can have arbitrarily large entries and arbitrarily large 1-norm. Example (3 4 6-tables) For 3 4 6 tables, minimal Markov basis has 355950 elements. Largest element has 1-norm 28. Which Fibers are Connected? Let B ker Z A. For which b is A 1 [b] B connected? When do u, v A 1 [b] belong to the same component of A 1 [b] B? Example (2 3) B = 1 1 0, 1 1 0 0 1 1 0 1 1 6 6 Seth Sullivant (NCSU) Algebraic Statistics June 10, 2012 11 / 28 Seth Sullivant (NCSU) Algebraic Statistics June 10, 2012 12 / 28

3 3 Enter Commutative Algebra Let K[p]:=K[p 1,...,p n ]. To each m B associate a binomial p m+ p m K[p] where m = m + m, p m = p m 1 1 pmn n. Proposition Let B ker Z A. Then u, v A 1 [b] are in the same component of A 1 [b] B if and only if p u p v I B := p m+ p m : m B. Theorem (Diaconis-Sturmfels (1998)) A set of moves B ker Z A is a Markov basis if and only if Lattice Walks and Primary Decomposition (Diaconis-Eisenbud-Sturmfels 1998) Decompose ideal I B = i I i. p u p v I B p u p v I i for all i. Hope that ideal I i are easier to analyze. Theorem (Eisenbud-Sturmfels 1996) Every binomial ideal has a binomial primary decomposition. Dickenstein-Matusevich-Miller, Kahle-Miller (Mesoprimary decomposition) Algorithms implemented inbinomials.m2 (Kahle 2010) I B = I A := p u p v : u, v N n, Au = Av. Seth Sullivant (NCSU) Algebraic Statistics June 10, 2012 13 / 28 Seth Sullivant (NCSU) Algebraic Statistics June 10, 2012 14 / 28 2 3 tables B = 1 1 0, 1 1 0 0 1 1 0 1 1 p I B = 11 p 12 p 21 p 22, p 12 p 13 p 22 p 23 p = 11 p 12 p 21 p 22, p 12 p 13 p 22 p 23, p 11 p 13 p 21 p 23 p 21, p 22 = I A p 21, p 22 u11 u 12 u 13 v11 v 12 v 13 connected by B if and only if u 21 u 22 u 23 v 21 v 22 v 23 they have the same row and column sums and u 12 + u 22 = v 12 + v 22 > 0. Graphical Models G a graph, N-vertices. d Z N, d i 2. Gives set of margins of d 1 d 2 d n array. C(G)=set of maximal cliques in G. Let A G,d : Z d 1 dn Z k be the linear map that computes the margins associated to all C C(G), of a d 1 d n array. Seth Sullivant (NCSU) Algebraic Statistics June 10, 2012 15 / 28 Seth Sullivant (NCSU) Algebraic Statistics June 10, 2012 16 / 28 Example (Row and Column Sums) A G,d : Z d 1 d 2 Z d 1+d 2 (u ij ) i,j (( u ij ) i,( u ij ) j ) j i A G,d : Z d 1 d 2 d 3 Z d 1 d 2 +d 1 d 3 (u ijk ) i,j,k (( u ijk ) i,j,( u ijk ) i,k ) k j Example (Path) Example (4-cycle) A G,d : Z d 1 d 2 d 3 Z d 1 d 2 +d 1 d 3 (u ijk ) i,j,k (( u ijk ) i,j,( u ijk ) i,k ) k j A G,d : Z d 1 d 2 d 3 d 4 Z d 1 d 2 +d 1 d 3 +d 2 d 4 +d 3 d 4 d =(2, 2, 3) 1 1 1 1 1 1 1 1 1 1 1 1 A G,d = 1 0 0 1 0 0 0 1 0 0 1 0 0 0 1 0 0 1 1 0 0 1 0 0 0 1 0 0 1 0 0 0 1 0 0 1 3 4 C(G)={{1, 2},{1, 3},{2, 4},{3, 4}} u =(u 111, u 112, u 113, u 121, u 122, u 123, u 211, u 212, u 213, u 221, u 222, u 223 ) Seth Sullivant (NCSU) Algebraic Statistics June 10, 2012 17 / 28 Seth Sullivant (NCSU) Algebraic Statistics June 10, 2012 18 / 28

Separating Moves (Conditional Independence) Let A, B, C partition V(G) such that C separates A and B in G. Get moves e ia i B i C + e ja j B i C e ia j B i C e ja i B i C where i A, j A t A [d t], i B, j B t B [d t], i C t C [d t] in ker Z A G,d. 1 1 These moves naturally generalize for 2-way tables. 1 1 CI(G) is set of all separating moves. Example (4-cycle) e i1 i 2 i 3 i 4 + e j1 i 2 i 3 j 4 e i1 i 2 i 3 j 4 e j1 i 2 i 3 i 4 Which Fibers Do CI(G) Moves Connect? Proposition (Hammersley-Clifford, Besag (1974)) CI(G) spans ker Z A G,d for all G. Theorem (Dobra (2002), Geiger, Meek, Sturmfels (2006)) Separating moves CI(G) are a Markov basis for A G,d if and only if G is a chordal graph. 1 Which fibers A 1 G,d [b] are connected by CI(G) for other graphs? 2 What is the primary decomposition of I CI(G)? 3 4 e i1 i 2 i 3 i 4 + e i1 j 2 j 3 i 4 e i1 i 2 j 3 i 4 e i1 j 2 i 3 i 4 Seth Sullivant (NCSU) Algebraic Statistics June 10, 2012 19 / 28 Seth Sullivant (NCSU) Algebraic Statistics June 10, 2012 20 / 28 Computational Results Theorem (Kahle-Rauh-S (2012)) Let #V(G)=n 5,d i = 2 for all i. Then I CI(G) is radical. A 1 G,d [b] CI(G) is connected if b is in the interior of the marginal cone. A 1 G,d [b] CI(G) is connected if b is positive (except for G = K 2,3 ). Every prime component I B of the form P S = p i : i S+I AS. Form vector u S := i/ S e i. Check if Au S is on boundary of marginal cone for all prime components. If so B has interior point property. 2 3 tables B = 1 1 0, 1 1 0 0 1 1 0 1 1 p I B = 11 p 12 p 21 p 22, p 12 p 13 p 22 p 23 = I A p 21, p 22 Analyze monomial ideal P S = p 21, p 22 1 0 1 u S = 1 0 1 u S has a zero column sum all fibers with positive margins (row and column sums) are connected. Seth Sullivant (NCSU) Algebraic Statistics June 10, 2012 21 / 28 Seth Sullivant (NCSU) Algebraic Statistics June 10, 2012 22 / 28 Theoretical Results Proposition (Kahle-Rauh-S (2012)) If G = G 1 #G 2 is a clique sum, then If I CI(G1 ) and I CI(G2 ) radical, so is I CI(G). If G 1 and G 2 satisfy interior point property, so does G. If G 1 and G 2 satisfy positive margins property, so does G. + = Theorem (Kahle-Rauh-S (2012)) 1 For cycle C n,i CI(Cn) is radical, when d i = 2 for all i. 2 For K 2,n with d 1 = d 2 = 2,I CI(K2,n ) is radical. Proof Ideas Find minimal primes for I CI(G). All binomial ideals. Let J = I CI(G) = I AG,d k i=1 P i. Let u, v such that A G,d u = A G,d v, so p u p v I A. Connect u and v using Markov basis moves of A G,d. Show that p u p v P i for all i, implies we can shortcut moves with CI(G) moves. Deduce that J = I CI(G). Depends on having Markov basis of A G,d, which is obtained in these cases via toric fiber product. (Engström, Kahle, S 2011) 3 Interior point property holds in both situations. Seth Sullivant (NCSU) Algebraic Statistics June 10, 2012 23 / 28 Seth Sullivant (NCSU) Algebraic Statistics June 10, 2012 24 / 28

Questions Summary Question Is I CI(G) radical for all G, d? Does interior point property hold for all G, d? Theorem If there are n 2 mutually orthogonal d d latin squares, then for any 2-connected, triangle free graph on G nodes, and d i = d for all i, the interior point property does not hold for (G, d). For C 4 and d =(3, 3, 3, 3) gives failure of interior point property. Radicality fails for K 3,3 and d =(2, 2, 2, 2, 2, 2). Many statistical problems require the construction of random walks over the lattice points in a polytope. A Markov basis provides connectivity for all b. If Markov basis too hard to compute, can ask: Which fibers are connected by a natural set of moves? Binomial primary decomposition gives information about connectivity of fibers with subset of Markov basis. Computational and theoretical advances allow us to make progress on graphical models. Seth Sullivant (NCSU) Algebraic Statistics June 10, 2012 25 / 28 Seth Sullivant (NCSU) Algebraic Statistics June 10, 2012 26 / 28 s References 3 4 J. Besag. (1974) Spatial Interaction and the Statistical Analysis of Lattice Systems, Journal of the Royal Statistical Society, Series B, 36 (2), 192Ð236. J. De Loera, S. Onn. Markov bases of three-way tables are arbitrarily complicated. Journal of Symbolic Computation, 41:173 181, 2006. P. Diaconis, D. Eisenbud, B. Sturmfels. Lattice walks and primary decomposition, Mathematical Essays in Honor of Gian-Carlo Rota, eds. B. Sagan and R. Stanley, Progress in Mathematics, Vol. 161, Birkhauser, Boston, 1998, pp. 173-193. P. Diaconis and B. Sturmfels. Algebraic algorithms for sampling from conditional distributions, Annals of Statistics 26 (1998) 363-397 A. Dickenstein, L. Matusevich, E. Miller. Combinatorics of binomial primary decomposition.0803.3846 1 Let d =(2, 2, 2, 2). Construct the 16 16 matrix A C4,d. 2 List the elements of CI(C 4 ) 3 Use 4ti2, Macaulay2, or Singular to compute the Markov basis of C 4. A. Dobra. Markov bases for decomposable graphical models. Bernoulli 9 No. 6,(2003) 1-16. D. Eisenbud, B. Sturmfels. Binomial ideals, Duke Mathematical Journal 84 (1996) 1-45. A. Engström, T. Kahle, S. Sullivant. Multigraded commutative algebra of graph decompositions. (2011)1102.2601 D. Geiger, C. Meek, B. Sturmfels. On the toric algebra of graphical models, Annals of Statistics 34 (2006) 1463-1492 T. Kahle, E. Miller. Decompositions of commutative monoid congruences and binomial ideals.arxiv:1107.4699 T. Kahle, J. Rauh, S. Sullivant. Positive margins and primary decomposition. (2012)1201.2591 Seth Sullivant (NCSU) Algebraic Statistics June 10, 2012 27 / 28 Seth Sullivant (NCSU) Algebraic Statistics June 10, 2012 28 / 28