Introduction to Algebraic Statistics

Similar documents
Phylogenetic Algebraic Geometry

Algebraic Statistics Tutorial I

Example: Hardy-Weinberg Equilibrium. Algebraic Statistics Tutorial I. Phylogenetics. Main Point of This Tutorial. Model-Based Phylogenetics

Phylogenetic invariants versus classical phylogenetics

Open Problems in Algebraic Statistics

Species Tree Inference using SVDquartets

The Generalized Neighbor Joining method

Using algebraic geometry for phylogenetic reconstruction

Jed Chou. April 13, 2015

ALGEBRAIC STATISTICAL MODELS

mathematics Smoothness in Binomial Edge Ideals Article Hamid Damadi and Farhad Rahmati *

Maximum Likelihood Estimates for Binary Random Variables on Trees via Phylogenetic Ideals

PHYLOGENETIC ALGEBRAIC GEOMETRY

The Maximum Likelihood Threshold of a Graph

Naïve Bayes classification

19 Tree Construction using Singular Value Decomposition

arxiv: v1 [math.st] 22 Jun 2018

The maximum likelihood degree of rank 2 matrices via Euler characteristics

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Readings: K&F: 16.3, 16.4, Graphical Models Carlos Guestrin Carnegie Mellon University October 6 th, 2008

Toric Fiber Products

Metric learning for phylogenetic invariants

Phylogenetic Geometry

ALGEBRAIC DEGREE OF POLYNOMIAL OPTIMIZATION. 1. Introduction. f 0 (x)

Identifiability of the GTR+Γ substitution model (and other models) of DNA evolution

Lecture 4 September 15

Geometry of Phylogenetic Inference

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Algebraic Invariants of Phylogenetic Trees

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Introductory Notes to Algebraic Statistics

Algebraic Complexity in Statistics using Combinatorial and Tensor Methods

CPSC 340: Machine Learning and Data Mining

Diffusion Models in Population Genetics

Maximum likelihood for dual varieties

Phylogenetics: Likelihood

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

The Geometry of Semidefinite Programming. Bernd Sturmfels UC Berkeley

The intersection axiom of

What is Singular Learning Theory?

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Bayesian Learning (II)

Solving the 100 Swiss Francs Problem

Quartet Inference from SNP Data Under the Coalescent Model

Modeling Environment

A Potpourri of Nonlinear Algebra

Is Every Secant Variety of a Segre Product Arithmetically Cohen Macaulay? Oeding (Auburn) acm Secants March 6, / 23

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

When does a mixture of products contain a product of mixtures?

Likelihood, MLE & EM for Gaussian Mixture Clustering. Nick Duffield Texas A&M University

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Introduction to Probability and Statistics (Continued)

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

A new parametrization for binary hidden Markov modes

PROBLEMS FOR VIASM MINICOURSE: GEOMETRY OF MODULI SPACES LAST UPDATED: DEC 25, 2013

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018

Lecture 2: Simple Classifiers

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

Semidefinite Programming

ADVANCED TOPICS IN ALGEBRAIC GEOMETRY

Testing Algebraic Hypotheses

Phylogeny of Mixture Models

Bayesian Methods: Naïve Bayes

COMP90051 Statistical Machine Learning

Phylogenetics: Parsimony and Likelihood. COMP Spring 2016 Luay Nakhleh, Rice University

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Total binomial decomposition (TBD) Thomas Kahle Otto-von-Guericke Universität Magdeburg

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Review: Probability. BM1: Advanced Natural Language Processing. University of Potsdam. Tatjana Scheffler

Probability and Estimation. Alan Moses

Geometry of Gaussoids

Algebraic Statistics. Seth Sullivant. North Carolina State University address:

Bayesian Analysis for Natural Language Processing Lecture 2

Parametric Models: from data to models

Maximum likelihood estimation

Graphical Models for Collaborative Filtering

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics

BMI/CS 776 Lecture 4. Colin Dewey

Lecture 21: Spectral Learning for Graphical Models

Sheaf cohomology and non-normal varieties

Probabilistic Graphical Models

ELIZABETH S. ALLMAN and JOHN A. RHODES ABSTRACT 1. INTRODUCTION

Overview of Course. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

EVOLUTIONARY DISTANCES

Probabilistic Machine Learning

Learning Bayesian network : Given structure and completely observed data

Likelihood Geometry. June Huh and Bernd Sturmfels

Binomial Exercises A = 1 1 and 1

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Probabilistic Systems Analysis Spring 2018 Lecture 6. Random Variables: Probability Mass Function and Expectation

Maximum Likelihood Estimation in Latent Class Models for Contingency Table Data

Outline. Limits of Bayesian classification Bayesian concept learning Probabilistic models for unsupervised and semi-supervised category learning

ALGEBRA: From Linear to Non-Linear. Bernd Sturmfels University of California at Berkeley

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Density Estimation. Seungjin Choi

Transcription:

Introduction to Algebraic Statistics Seth Sullivant North Carolina State University January 5, 2017 Seth Sullivant (NCSU) Algebraic Statistics January 5, 2017 1 / 28

What is Algebraic Statistics? Observation Statistical models are semialgebraic sets. Tools from algebraic geometry can uncover properties of statistical models. Algebro-geometric structure leads to new methods to analyze data. Algebraic properties of the model explain properties of inference procedures. Theoretical questions about statistical models lead to challenging problems in algebraic geometry. Seth Sullivant (NCSU) Algebraic Statistics January 5, 2017 2 / 28

Semialgebraic Sets R[p 1,..., p r ] = ring of polynomials in indeterminates p 1,..., p r with coefficients in R. Definition A basic semialgebraic set is the solution to finitely many polynomial equations and inequalities over R. That is, a set of the form: {a R r : f (a) = 0 for all f F, g(a) > 0 for all g G} where F, G R[p 1,..., p r ] are finite. A semialgebraic set is the union of finitely many basic semialgebraic sets. Semialgebraic sets are the basic objects of study in real algebraic geometry. Seth Sullivant (NCSU) Algebraic Statistics January 5, 2017 3 / 28

Semialgebraic Sets M 2 = {(p 0, p 1, p 2 ) R 3 : p 0 0, p 1 0, p 2 0, p 0 + p 1 + p 2 = 1, 4p 0 p 2 = p 2 1 } Theorem (Tarski-Seidenberg) The set of semialgebraic sets is closed under rational maps. That is, if S R r is a semialgebraic set, and φ : R r R s is a rational map then φ(s) R s is a semialgebraic set. Seth Sullivant (NCSU) Algebraic Statistics January 5, 2017 4 / 28

Statistical Models Definition A statistical model is a set of probability distributions. For this definition to be useful for us: Focus on a fixed state space. The set of distributions should be related to each other in a meaningful way. To keep things simple for this talk: State space will always be a finite set, like [r] := {1, 2,..., r}. In this case a probability distribution is just a point p R r : p = (p 1,..., p r ) = (P(X = 1),..., P(X = r)) R r satisfying r i=1 p i = 1 and p i 0 for all i [r]. Seth Sullivant (NCSU) Algebraic Statistics January 5, 2017 5 / 28

Example: Binomial Random Variable Model We have a biased coin with probability θ of getting a head. Flip the coin r times, and count the number of heads X. ( ) r P(X = i) = θ i (1 θ) r i i Binomial probability distribution is the point in R r+1. ((1 θ) r, rθ(1 θ) r 1,..., rθ r 1 (1 θ), θ r ) The binomial random variable model is the set {( ) } M r := (1 θ) r, rθ(1 θ) r 1,..., rθ r 1 (1 θ), θ r : θ [0, 1] M 2 = {(p 0, p 1, p 2 ) R 3 : p 0 0, p 1 0, p 2 0, p 0 + p 1 + p 2 = 1, 4p 0 p 2 = p1 2 } Seth Sullivant (NCSU) Algebraic Statistics January 5, 2017 6 / 28

Independence Model Definition Discrete random variables X and Y are independent if for all i and j Denote this X Y. P(X = i, Y = j) = P(X = i) P(Y = j) (1) Fix state space of X and Y as [r] and [s]. The model of independence M X Y Rr s is the set of all distribution satisfying (1). p 11 p 1s M X Y = p =..... : p satisfies (1) p r1 p rs Seth Sullivant (NCSU) Algebraic Statistics January 5, 2017 7 / 28

Independence Model Note that: means that p 11 p 1s p =..... p r1 p rs P(X = i, Y = j) = P(X = i) P(Y = j) = P(X = 1). P(X = r) ( P(Y = 1) P(Y = s) ) Proposition The independence model M X Y is the semialgebraic set {p R r s : p ij 0 for all i, j pij = 1, rank p = 1} Seth Sullivant (NCSU) Algebraic Statistics January 5, 2017 8 / 28

Statistical Models are Semialgebraic Sets Most statistical models are given parametrically as images of rational functions, where the domain is a polyhedron. φ : [0, 1] R r+1, θ ((1 ) θ) r, rθ(1 θ) r 1,..., rθ r 1 (1 θ), θ r Tarski-Seidenberg Theorem guarantees these models are semialgebraic sets. Problems to study with any family of models: Characterize the model implicitly. Characterize model properties. Use the algebraic structure to analyze procedures based on the model. Seth Sullivant (NCSU) Algebraic Statistics January 5, 2017 9 / 28

Phylogenetics Problem Given a collection of species, find the tree that explains their history. Data consists of aligned DNA sequences from homologous genes Human: Chimp: Gorilla: Seth Sullivant (NCSU)...ACCGTGCAACGTGAACGA......ACCTTGGAAGGTAAACGA......ACCGTGCAACGTAAACTA... Algebraic Statistics January 5, 2017 10 / 28

Model-Based Phylogenetics Use a probabilistic model of mutations Parameters for the model are the combinatorial tree T, and rate parameters for mutations on each edge Models give a probability for observing a particular aligned collection of DNA sequences Human: Chimp: Gorilla: ACCGTGCAACGTGAACGA ACGTTGCAAGGTAAACGA ACCGTGCAACGTAAACTA Assuming site independence, data is summarized by empirical distribution of columns in the alignment. e.g. ˆp(AAA) = 6 18, ˆp(CGC) = 2 18, etc. Use empirical distribution and test statistic to find tree best explaining data Seth Sullivant (NCSU) Algebraic Statistics January 5, 2017 11 / 28

Phylogenetic Models Assuming site independence: Phylogenetic Model is a latent class graphical model Leaf v T is random variable X v {A, C, G, T}. Internal nodes v T are latent random variables Y v X 1 X 2 X 3 Y 2 Y 1 P(x 1, x 2, x 3 ) = y 1 y 2 P 1 (y 1 )P 2 (y 2 y 1 )P 3 (x 1 y 1 )P 4 (x 2 y 2 )P 5 (x 3 y 2 ) Seth Sullivant (NCSU) Algebraic Statistics January 5, 2017 12 / 28

Phylogenetic Models Assuming site independence: Phylogenetic Model is a latent class graphical model Leaf v T is random variable X v {A, C, G, T}. Internal nodes v T are latent random variables Y v X 1 X 2 X 3 Y 2 Y 1 p i1 i 2 i 3 = j 1 j 2 π j1 a j2,j 1 b i1,j 1 c i2,j 2 d i3,j 2 Seth Sullivant (NCSU) Algebraic Statistics January 5, 2017 13 / 28

Algebraic Perspective on Phylogenetic Models Once we fix a tree T with n leaves and model structure, we get a map φ T : Θ R 4n. Θ R d is a parameter space of numerical parameters (transition matrices associated to each edge). The map φ T is given by polynomial functions of the parameters. For each i 1 i n {A, C, G, T } n, φ T i 1 i n (θ) gives the probability of the column (i 1,..., i n ) in the alignment for the particular parameter choice θ. φ T i 1 i 2 i 3 (π, a, b, c, d) = π j1 a j2,j 1 b i1,j 1 c i2,j 2 d i3,j 2 j 1 j 2 The phylogenetic model is the set M T = φ T (Θ) R 4n. Seth Sullivant (NCSU) Algebraic Statistics January 5, 2017 14 / 28

Phylogenetic Varieties and Phylogenetic Invariants Let R[p] := R[p i1 i n : i 1 i n {A, C, G, T } n ] Definition Let I T := f R[p] : f (p) = 0 for all p M T R[p]. I T is the ideal of phylogenetic invariants of T. Let V T := {p R 4n : f (p) = 0 for all f I T }. V T is the phylogenetic variety of T. Note that M T V T. Since M T is image of a polynomial map dim M T = dim V T. Seth Sullivant (NCSU) Algebraic Statistics January 5, 2017 15 / 28

X 1 X X X 2 3 4 p lmno = T T T π i a ij b ik c jl d jm e kn f ko i=a j=a k=a Seth Sullivant (NCSU) Algebraic Statistics January 5, 2017 16 / 28

X 1 X X X 2 3 4 p lmno = T T T π i a ij b ik c jl d jm e kn f ko i=a j=a k=a T a ij c jl d jm j=a Seth Sullivant (NCSU) Algebraic Statistics January 5, 2017 16 / 28

X 1 X X X 2 3 4 p lmno = T T T π i a ij b ik c jl d jm e kn f ko i=a j=a k=a T T a ij c jl d jm b ik e kn f ko j=a k =A Seth Sullivant (NCSU) Algebraic Statistics January 5, 2017 16 / 28

X 1 X X X 2 3 4 p lmno = = T T T π i a ij b ik c jl d jm e kn f ko i=a j=a k=a T T T a ij c jl d jm b ik e kn f ko π i i=a j=a k =A Seth Sullivant (NCSU) Algebraic Statistics January 5, 2017 16 / 28

X 1 X X X 2 3 4 p lmno = = T T T π i a ij b ik c jl d jm e kn f ko i=a j=a k=a T T T a ij c jl d jm b ik e kn f ko π i i=a j=a k =A p AAAA p AAAC p AATT p = rank ACAA p ACAC p ACTT...... 4 p TTAA p TTAC p TTTT Seth Sullivant (NCSU) Algebraic Statistics January 5, 2017 16 / 28

Splits and Phylogenetic Invariants Definition A split of a set is a bipartition A B. A split A B of the leaves of a tree T is valid for T if the induced trees T A and T B do not intersect. X 1 X X X 2 3 4 Valid: 12 34 Not Valid: 13 24 Seth Sullivant (NCSU) Algebraic Statistics January 5, 2017 17 / 28

2-way Flattenings and Matrix Ranks Proposition Let P M T. p ijkl = P(X 1 = i, X 2 = j, X 3 = k, X 4 = l) p AAAA p AAAC p AAAG p AATT p ACAA p ACAC p ACAG p ACTT Flat 12 34 (P) =....... p TTAA p TTAC p TTAG p TTTT If A B is a valid split for T, then rank(flat A B (P)) 4. Invariants in I T are subdeterminants of Flat A B (P). If C D is not a valid split for T, then generically rank(flat C D (P)) > 4. Seth Sullivant (NCSU) Algebraic Statistics January 5, 2017 18 / 28

Using Matrix Flattenings Proposition The set of valid splits for a given tree uniquely determine the tree. Proposition Let P M T. If A B is a valid split for T, then rank(flat A B (P)) 4. Invariants in I T are subdeterminants of Flat A B (P). If C D is not a valid split for T, then generically rank(flat C D (P)) > 4. Can use flattening invariants to reconstruct phylogenetic trees (Eriksson 2005, Chifman-Kubatko 2014, Allman-Kubatko-Rhodes 2016) via the SVD. Seth Sullivant (NCSU) Algebraic Statistics January 5, 2017 19 / 28

Identifiability of Phylogenetic Models Definition A parametric statistical model is identifiable if it gives 1-to-1 map from parameters to probability distributions. Is it possible to infer the parameters of the model from data? Identifiability guarantees consistency of statistical methods (ML) Two types of parameters to consider for phylogenetic models: Numerical parameters (transition matrices) Tree parameter (combinatorial type of tree) Seth Sullivant (NCSU) Algebraic Statistics January 5, 2017 20 / 28

Generic Identifiability Definition The tree parameter in a phylogenetic model is generically identifiable if for all n-leaf trees with T T, dim(m T M T ) < min(dim(m T ), dim(m T )). M1 M2 M3 Seth Sullivant (NCSU) Algebraic Statistics January 5, 2017 21 / 28

Proving Identifiability with Algebraic Geometry Proposition Let M 0 and M 1 be two algebraic models. If there exist phylogenetically informative invariants f 0 and f 1 such that f i (p) = 0 for all p M i, and f i (q) 0 for some q M 1 i, then dim(m 0 M 1 ) < min(dim M 0, dim M 1 ). M0 f = 0 q M 1 Seth Sullivant (NCSU) Algebraic Statistics January 5, 2017 22 / 28

Phylogenetic Models are Identifiable Theorem The unrooted tree parameter of phylogenetic models is generically identifiable. Proof. Edge flattening invariants can detect which splits are implied by a specific distribution in M T. The splits in T uniquely determine T. These methods can be generalized to much more complicated settings: Phylogenetic mixture models (Allman-Petrovic-Rhodes-Sullivant 2009, Rhodes-Sullivant 2012, Long-Sullivant 2015) Species tree problems (Allman-Degnan-Rhodes 2011, Chifman-Kubatko 2014) K-mer methods (Allman-Rhodes-Sullivant 2016, Durden-Sullivant 201?) Seth Sullivant (NCSU) Algebraic Statistics January 5, 2017 23 / 28

Maximum Likelihood Estimation Let M R r a statistical model for discrete random variables. We collect i.i.d. data, X (1),..., X (n), we believe to be generated from a distribution in M. Let u = (u 1,..., u r ) be the vector of counts: u i = #{j : X (j) = i} Given the data u, the likelihood function is the monomial p u = p u 1 1 pu 2 2 pur r which is the probability of observing the data when P(X = i) = p i. The maximum likelihood estimate (MLE) of p given u is: argmax p u such that p M Seth Sullivant (NCSU) Algebraic Statistics January 5, 2017 24 / 28

Maximum Likelihood Degree Computing the maximum likelihood estimate argmax p u such that p M is an optimization problem of a polynomial constrained to a semialgebraic set. We can extended to the Zariski closure V = M, in which case the optimum should be a critical point of the likelihood function (for generic u). Theorem (Catanese-Hoşten-Khetan-Sturmfels 2006) The number of complex critical points of the likelihood function on V sm = V \ V sing is constant for generic u. This number if called the maximum likelihood degree of the variety V (or model M). Seth Sullivant (NCSU) Algebraic Statistics January 5, 2017 25 / 28

Independence Model Consider M X Y. Maximum likelihood estimation is: argmax i,j p u ij ij such that p M X Y But p M X Y iff there are α r, β s with p ij = α i β j. Becomes: argmax α u i+ i β u +j i such that α r, β s i j Maximum likelihood estimate is ˆp ij = u i+u +j u 2 ++ = u i+ u ++ u+j u ++ M X Y has ML-degree 1 (= MLE is rational function of the data). Seth Sullivant (NCSU) Algebraic Statistics January 5, 2017 26 / 28

ML-degree and Algebraic Geometry Theorem (Huh 2014) A variety V has maximum likelihood degree 1 if and only if it is a reduced A-discriminant. Theorem (Huh 2012) Suppose that V (C ) r is a smooth variety. Then the ML-degree of V is the reduced Euler characteristic of V (C ) r. When is a reduced A-discriminant a statistical model? Does large ML-degree affect algorithms for computing MLEs? Seth Sullivant (NCSU) Algebraic Statistics January 5, 2017 27 / 28

Further Applications of Algebraic Statistics Fisher s Exact Test (Diaconis-Sturmfels 1998) (Non)existence of maximum likelihood estimates (Fienberg) Bayesian integrals and information criteria (Watanabe 2009) Conditional independence structures (Studeny 2004) Graphical models (Lauritzen 1996) Parametric inference (Pachter-Sturmfels 2004) Learn More: Algebraic Statistics the book Draft version available at my website. Comments welcome! Seth Sullivant (NCSU) Algebraic Statistics January 5, 2017 28 / 28