Using algebraic geometry for phylogenetic reconstruction

Similar documents
Phylogenetic invariants versus classical phylogenetics

Practical Bioinformatics

PHYLOGENETIC ALGEBRAIC GEOMETRY

The Generalized Neighbor Joining method

Algebraic Statistics Tutorial I

Phylogenetic Algebraic Geometry

Advanced topics in bioinformatics

SEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS. Prokaryotes and Eukaryotes. DNA and RNA

SUPPORTING INFORMATION FOR. SEquence-Enabled Reassembly of β-lactamase (SEER-LAC): a Sensitive Method for the Detection of Double-Stranded DNA

Characterization of Pathogenic Genes through Condensed Matrix Method, Case Study through Bacterial Zeta Toxin

Regulatory Sequence Analysis. Sequence models (Bernoulli and Markov models)

High throughput near infrared screening discovers DNA-templated silver clusters with peak fluorescence beyond 950 nm

Crick s early Hypothesis Revisited

Supplemental data. Pommerrenig et al. (2011). Plant Cell /tpc

Clay Carter. Department of Biology. QuickTime and a TIFF (Uncompressed) decompressor are needed to see this picture.

Introduction to Molecular Phylogeny

SSR ( ) Vol. 48 No ( Microsatellite marker) ( Simple sequence repeat,ssr),

GEOMETRY OF THE KIMURA 3-PARAMETER MODEL arxiv:math/ v1 [math.ag] 27 Feb 2007

Supplementary Information for

arxiv: v1 [q-bio.pe] 27 Oct 2011

Why do more divergent sequences produce smaller nonsynonymous/synonymous

SUPPLEMENTARY DATA - 1 -

Evolvable Neural Networks for Time Series Prediction with Adaptive Learning Interval

Phylogenetic Tree Reconstruction

Evolutionary Analysis of Viral Genomes

Nature Structural & Molecular Biology: doi: /nsmb Supplementary Figure 1

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Metric learning for phylogenetic invariants

Introduction to Algebraic Statistics

Phylogenetics: Parsimony and Likelihood. COMP Spring 2016 Luay Nakhleh, Rice University

Number-controlled spatial arrangement of gold nanoparticles with

NSCI Basic Properties of Life and The Biochemistry of Life on Earth

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

TM1 TM2 TM3 TM4 TM5 TM6 TM bp

3. Evolution makes sense of homologies. 3. Evolution makes sense of homologies. 3. Evolution makes sense of homologies

Electronic supplementary material

Table S1. Primers and PCR conditions used in this paper Primers Sequence (5 3 ) Thermal conditions Reference Rhizobacteria 27F 1492R

Algebraic Invariants of Phylogenetic Trees

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

The Mathematics of Phylogenomics

Open Problems in Algebraic Statistics

Supporting Information

SUPPLEMENTARY INFORMATION

Example: Hardy-Weinberg Equilibrium. Algebraic Statistics Tutorial I. Phylogenetics. Main Point of This Tutorial. Model-Based Phylogenetics

Phylogeny: traditional and Bayesian approaches

Geometry of Phylogenetic Inference

Supporting Information for. Initial Biochemical and Functional Evaluation of Murine Calprotectin Reveals Ca(II)-

Building a Multifunctional Aptamer-Based DNA Nanoassembly for Targeted Cancer Therapy

SUPPLEMENTARY INFORMATION

Evolutionary Models. Evolutionary Models

Modelling and Analysis in Bioinformatics. Lecture 1: Genomic k-mer Statistics

The Trigram and other Fundamental Philosophies

Supplemental Table 1. Primers used for cloning and PCR amplification in this study

Molecular Evolution and Phylogenetic Tree Reconstruction

Protein Threading. Combinatorial optimization approach. Stefan Balev.

EVOLUTIONARY DISTANCES

Probabilistic modeling and molecular phylogeny

19 Tree Construction using Singular Value Decomposition

Phylogeny of Mixture Models

Supplemental Figure 1.

ELIZABETH S. ALLMAN and JOHN A. RHODES ABSTRACT 1. INTRODUCTION

Estimating Phylogenies (Evolutionary Trees) II. Biol4230 Thurs, March 2, 2017 Bill Pearson Jordan 6-057

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

arxiv: v1 [q-bio.pe] 23 Nov 2017

Phylogenetics: Parsimony

Sex-Linked Inheritance in Macaque Monkeys: Implications for Effective Population Size and Dispersal to Sulawesi

Counting phylogenetic invariants in some simple cases. Joseph Felsenstein. Department of Genetics SK-50. University of Washington

Reconstruire le passé biologique modèles, méthodes, performances, limites

Reading for Lecture 13 Release v10

Supplementary Information

Evidence for Evolution: Change Over Time (Make Up Assignment)

Codon Distribution in Error-Detecting Circular Codes

The role of the FliD C-terminal domain in pentamer formation and

A new parametrization for binary hidden Markov modes

part 4: phenomenological load and biological inference. phenomenological load review types of models. Gαβ = 8π Tαβ. Newton.

The 3 Genomic Numbers Discovery: How Our Genome Single-Stranded DNA Sequence Is Self-Designed as a Numerical Whole

USING INVARIANTS FOR PHYLOGENETIC TREE CONSTRUCTION

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

part 3: analysis of natural selection pressure

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Phylogenetic inference

BMI/CS 776 Lecture 4. Colin Dewey

Molecular phylogeny - Using molecular sequences to infer evolutionary relationships. Tore Samuelsson Feb 2016

Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM).

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

Phylogenetic Networks, Trees, and Clusters

Dr. Amira A. AL-Hosary

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

evoglow - express N kit distributed by Cat.#: FP product information broad host range vectors - gram negative bacteria

The Algebraic Degree of Semidefinite Programming

Evolutionary dynamics of abundant stop codon readthrough in Anopheles and Drosophila

HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM

Molecular Evolution, course # Final Exam, May 3, 2006

evoglow - express N kit Cat. No.: product information broad host range vectors - gram negative bacteria

Commuting birth-and-death processes

Mathematical Biology. Phylogenetic mixtures and linear invariants for equal input models. B Mike Steel. Marta Casanellas 1 Mike Steel 2

arxiv: v5 [q-bio.pe] 24 Oct 2016

3/1/17. Content. TWINSCAN model. Example. TWINSCAN algorithm. HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM

ChemiScreen CaS Calcium Sensor Receptor Stable Cell Line

Re- engineering cellular physiology by rewiring high- level global regulatory genes

Transcription:

Using algebraic geometry for phylogenetic reconstruction Marta Casanellas i Rius (joint work with Jesús Fernández-Sánchez) Departament de Matemàtica Aplicada I Universitat Politècnica de Catalunya IMA Workshop Applications in Biology, Dynamics, and Statistics March 8, 2007 M. Casanellas (UPC) Using algebraic geometry... IMA, March 2007 1 / 36

Outline 1 Algebraic evolutionary models 2 Phylogenetic inference using algebraic geometry 3 The geometry of the Kimura variety 4 New results on simulated data M. Casanellas (UPC) Using algebraic geometry... IMA, March 2007 2 / 36

Outline 1 Algebraic evolutionary models 2 Phylogenetic inference using algebraic geometry 3 The geometry of the Kimura variety 4 New results on simulated data M. Casanellas (UPC) Using algebraic geometry... IMA, March 2007 2 / 36

Outline 1 Algebraic evolutionary models 2 Phylogenetic inference using algebraic geometry 3 The geometry of the Kimura variety 4 New results on simulated data M. Casanellas (UPC) Using algebraic geometry... IMA, March 2007 2 / 36

Outline 1 Algebraic evolutionary models 2 Phylogenetic inference using algebraic geometry 3 The geometry of the Kimura variety 4 New results on simulated data M. Casanellas (UPC) Using algebraic geometry... IMA, March 2007 2 / 36

Outline 1 Algebraic evolutionary models 2 Phylogenetic inference using algebraic geometry 3 The geometry of the Kimura variety 4 New results on simulated data M. Casanellas (UPC) Using algebraic geometry... IMA, March 2007 3 / 36

Phylogenetic reconstruction Given n current species, e.g. HUMAN, GORILLA, CHIMP, given part of their genome and an alignment alignment Goal: to reconstruct their ancestral relationships (phylogeny): M. Casanellas (UPC) Using algebraic geometry... IMA, March 2007 4 / 36

Phylogenetic reconstruction Given n current species, e.g. HUMAN, GORILLA, CHIMP, given part of their genome and an alignment alignment Goal: to reconstruct their ancestral relationships (phylogeny): M. Casanellas (UPC) Using algebraic geometry... IMA, March 2007 4 / 36

Alignment Mutations, deletions and insertions of nucleotides occur along the speciation process. Given two DNA sequences, an alignment is a correspondence between them that accounts for their differences. The optimal alignment is the one that minimizes the number of mutations, deletions and insertions. seq 1 : ACGTAGCTAAGTTA... seq 2 : ACCGAGACCCAGTA... A possible alignment is: seq 1 A C G T A G C T A A G T T A seq 2 A C C G A G A C C C A G T A M. Casanellas (UPC) Using algebraic geometry... IMA, March 2007 5 / 36

Alignment Mutations, deletions and insertions of nucleotides occur along the speciation process. Given two DNA sequences, an alignment is a correspondence between them that accounts for their differences. The optimal alignment is the one that minimizes the number of mutations, deletions and insertions. seq 1 : ACGTAGCTAAGTTA... seq 2 : ACCGAGACCCAGTA... A possible alignment is: seq 1 A C G T A G C T A A G T T A seq 2 A C C G A G A C C C A G T A M. Casanellas (UPC) Using algebraic geometry... IMA, March 2007 5 / 36

Phylogenetic reconstruction Given n current species, e.g. HUMAN, GORILLA, CHIMP, given part of their genome and an alignment alignment AACTTCGAGGCTTACCGCTG AAGGTCGATGCTCACCGATG AACGTCTATGCTCACCGATG Goal: to reconstruct their ancestral relationships (phylogeny): M. Casanellas (UPC) Using algebraic geometry... IMA, March 2007 6 / 36

Phylogenetic reconstruction Given n current species, e.g. HUMAN, GORILLA, CHIMP, given part of their genome and an alignment alignment AACTTCGAGGCTTACCGCTG AAGGTCGATGCTCACCGATG AACGTCTATGCTCACCGATG Goal: to reconstruct their ancestral relationships (phylogeny): M. Casanellas (UPC) Using algebraic geometry... IMA, March 2007 6 / 36

Algebraic evolutionary models Assume that all sites of the alignment evolve equally and independently. At each node of the tree we put a random variable taking values in {A,C,G,T} t i = branch length (represents the number of mutations per site along that branch) Variables at the leaves are observed and variables at the interior nodes are hidden. At each branch we write a matrix (substitution matrix) with the probabilities of a nucleotide at the parent node being substituted by another at its child. S = A C G T A C G T P(A A) P(C A) P(G A) P(T A) P(A C) P(C C) P(G C) P(T C) P(A G) P(C G) P(G G) P(T G) P(A T ) P(C T ) P(G T ) P(T T ) M. Casanellas (UPC) Using algebraic geometry... IMA, March 2007 7 / 36

Algebraic evolutionary models Assume that all sites of the alignment evolve equally and independently. At each node of the tree we put a random variable taking values in {A,C,G,T} t i = branch length (represents the number of mutations per site along that branch) Variables at the leaves are observed and variables at the interior nodes are hidden. At each branch we write a matrix (substitution matrix) with the probabilities of a nucleotide at the parent node being substituted by another at its child. S = A C G T A C G T P(A A) P(C A) P(G A) P(T A) P(A C) P(C C) P(G C) P(T C) P(A G) P(C G) P(G G) P(T G) P(A T ) P(C T ) P(G T ) P(T T ) M. Casanellas (UPC) Using algebraic geometry... IMA, March 2007 7 / 36

The entries of S i are unknown parameters. Example (Group-based models. Kimura 3-parameters) Root node has uniform distribution: π A =... = π T = 0.25 S i = A C G T A C G T ( ai b i c i d i b i a i d i c i c i d i a i b i d i c i b i a i Kimura 2-parameters: b i = d i. Jukes-Cantor: b i = c i = d i. ), a i + b i + c i + d i = 1 M. Casanellas (UPC) Using algebraic geometry... IMA, March 2007 8 / 36

The entries of S i are unknown parameters. Example (Group-based models. Kimura 3-parameters) Root node has uniform distribution: π A =... = π T = 0.25 S i = A C G T A C G T ( ai b i c i d i b i a i d i c i c i d i a i b i d i c i b i a i Kimura 2-parameters: b i = d i. Jukes-Cantor: b i = c i = d i. ), a i + b i + c i + d i = 1 M. Casanellas (UPC) Using algebraic geometry... IMA, March 2007 8 / 36

The entries of S i are unknown parameters. Example (Group-based models. Kimura 3-parameters) Root node has uniform distribution: π A =... = π T = 0.25 S i = A C G T A C G T ( ai b i c i d i b i a i d i c i c i d i a i b i d i c i b i a i Kimura 2-parameters: b i = d i. Jukes-Cantor: b i = c i = d i. ), a i + b i + c i + d i = 1 M. Casanellas (UPC) Using algebraic geometry... IMA, March 2007 8 / 36

Hidden Markov process We denote the joint distribution of the observed variables X 1, X 2, X 3 as p x1 x 2 x 3 = Prob(X 1 = x 1, X 2 = x 2, X 3 = x 3 ). p x1 x 2 x 3 = π yr S 1 (x 1, y r )S 4 (y 4, y r )S 2 (x 2, y 4 )S 3 (x 3, y 4 ) y 4,y r {A,C,G,T} p x1 x 2 x 3 is a homogeneous polynomial on the parameters whose degree is the number of edges. M. Casanellas (UPC) Using algebraic geometry... IMA, March 2007 9 / 36

The evolutionary model defines a polynomial map ϕ : R d R 4n θ = (θ 1,..., θ d ) (p AA...A, p AA...C, p AA...G,..., p TT...T ) ϕ : d 1 4n 1 ϕ : C d C 4n Algebraic variety V = imϕ, closure in Zariski topology. Given an alignment, we can estimate the joint probability p x1,x 2,x 3 as the relative frequency of column x 1 x 2 x 3 in the alignment. In the theoretical model, this would be a point on the variety V. Goal: use the ideal of V, I(V ) R[p AA...A, p AA...C,..., p TT...T ], to infer the topology of the phylogenetic tree. M. Casanellas (UPC) Using algebraic geometry... IMA, March 2007 10 / 36

The evolutionary model defines a polynomial map ϕ : R d R 4n θ = (θ 1,..., θ d ) (p AA...A, p AA...C, p AA...G,..., p TT...T ) ϕ : d 1 4n 1 ϕ : C d C 4n Algebraic variety V = imϕ, closure in Zariski topology. Given an alignment, we can estimate the joint probability p x1,x 2,x 3 as the relative frequency of column x 1 x 2 x 3 in the alignment. In the theoretical model, this would be a point on the variety V. Goal: use the ideal of V, I(V ) R[p AA...A, p AA...C,..., p TT...T ], to infer the topology of the phylogenetic tree. M. Casanellas (UPC) Using algebraic geometry... IMA, March 2007 10 / 36

The evolutionary model defines a polynomial map ϕ : R d R 4n θ = (θ 1,..., θ d ) (p AA...A, p AA...C, p AA...G,..., p TT...T ) ϕ : d 1 4n 1 ϕ : C d C 4n Algebraic variety V = imϕ, closure in Zariski topology. Given an alignment, we can estimate the joint probability p x1,x 2,x 3 as the relative frequency of column x 1 x 2 x 3 in the alignment. In the theoretical model, this would be a point on the variety V. Goal: use the ideal of V, I(V ) R[p AA...A, p AA...C,..., p TT...T ], to infer the topology of the phylogenetic tree. M. Casanellas (UPC) Using algebraic geometry... IMA, March 2007 10 / 36

Computing the ideal (Eriksson Ranestad Sturmfels Sullivant 05) Some generators of I(V ) depend only on the model chosen (not on the topology). E.g. for Jukes-Cantor they are: p x1 x 2 x 3 = 1 p AAA = p CCC = p GGG = p TTT 4 terms p AAC = p AAG = p AAT = = p TTG 12 terms p ACA = p AGA = p ATA = = p TGT 12 terms p CAA = p GAA = p TAA = = p GTT 12 terms p ACG = p ACT = p AGT = = p CGT 24 terms For this model, the unique polynomial that detects the phylogeny of the three species has degree 3. Definition (Cavender Felsenstein 87) The generators of I(V ) are called phylogenetic invariants. M. Casanellas (UPC) Using algebraic geometry... IMA, March 2007 11 / 36

Computing the ideal (Eriksson Ranestad Sturmfels Sullivant 05) Some generators of I(V ) depend only on the model chosen (not on the topology). E.g. for Jukes-Cantor they are: p x1 x 2 x 3 = 1 p AAA = p CCC = p GGG = p TTT 4 terms p AAC = p AAG = p AAT = = p TTG 12 terms p ACA = p AGA = p ATA = = p TGT 12 terms p CAA = p GAA = p TAA = = p GTT 12 terms p ACG = p ACT = p AGT = = p CGT 24 terms For this model, the unique polynomial that detects the phylogeny of the three species has degree 3. Definition (Cavender Felsenstein 87) The generators of I(V ) are called phylogenetic invariants. M. Casanellas (UPC) Using algebraic geometry... IMA, March 2007 11 / 36

Computing the ideal (Eriksson Ranestad Sturmfels Sullivant 05) Some generators of I(V ) depend only on the model chosen (not on the topology). E.g. for Jukes-Cantor they are: p x1 x 2 x 3 = 1 p AAA = p CCC = p GGG = p TTT 4 terms p AAC = p AAG = p AAT = = p TTG 12 terms p ACA = p AGA = p ATA = = p TGT 12 terms p CAA = p GAA = p TAA = = p GTT 12 terms p ACG = p ACT = p AGT = = p CGT 24 terms For this model, the unique polynomial that detects the phylogeny of the three species has degree 3. Definition (Cavender Felsenstein 87) The generators of I(V ) are called phylogenetic invariants. M. Casanellas (UPC) Using algebraic geometry... IMA, March 2007 11 / 36

Problem: computation of invariants Computational algebra software fail to compute the ideal for 4 species! Kimura 3-parameter, 4 species, 8002 generators like: [Small trees webpage] M. Casanellas (UPC) Using algebraic geometry... IMA, March 2007 12 / 36

Problem: computation of invariants Computational algebra software fail to compute the ideal for 4 species! Kimura 3-parameter, 4 species, 8002 generators like: [Small trees webpage] M. Casanellas (UPC) Using algebraic geometry... IMA, March 2007 12 / 36

Group-based models. Discrete Fourier transform Group-based models (Kimura and Jukes-Cantor): G = Z 2 Z 2, A=(0,0),C=(0,1),G=(1,0),T=(1,1). Substitution matrices are functions on the group: S i (g p, g c ) = f i (g p g c ), g c, g p G. p g1,...,g m = p(g 1,..., g m ) = 1 4 e edges sum over all possible values at interior nodes. f e (g p(e) g c(e) ) Discrete Fourier transform f : G C ˆf (χ) = χ(g)f (g), χ Hom(G, C ) = G g G Convolution: (f 1 f 2 )(g) = h G f 1 (h)f 2 (g h) = f 1 f 2 = f 1 f 2 M. Casanellas (UPC) Using algebraic geometry... IMA, March 2007 13 / 36

Group based models. Theorem (Evans Speed) For a group-based model (i.e. Kimura 3, Kimura 2 or Jukes-Cantor) on a tree T, the discrete Fourier transform of the joint distribution p(g 1,..., g n ) has the following form q(χ 1,..., χ m ) = e edges f e( l leaves below e χ l ) Using a linear change of coordinates, one gets a monomial parameterization of the variety associated to the model ϕ : C d C 4n θ = ( θ 1,..., θ d ) (q AA...A, q AA...C, q AA...G,..., q TT...T ) so that it is a toric variety. The ideal is generated by binomials. M. Casanellas (UPC) Using algebraic geometry... IMA, March 2007 14 / 36

Fourier transform for Kimura 3-parameter model Parameters S e = a e b e c e d e b e a e d e c e c e d e a e b e d e c e b e a e Fourier parameters Pe = simplex: 3 3 a e+b e+c e+d e=1 P e A =1 PA e 0 0 0 0 PC e 0 0 0 0 PG e 0 0 0 0 PT e P e A =ae +b e +c e +d e, P e C =ae b e +c e d e... coordinates: p x1...x n q g1...g n, g 1 + +g n=0 simplex: 4n 1 4n 1 1 px1...x n = 1 q A...A = 1 Linear invariants: q g1...g n = 0 if g 1 + + g n 0 in Z 2 Z 2. M. Casanellas (UPC) Using algebraic geometry... IMA, March 2007 15 / 36

For the Kimura 3-parameter model, we are interested in V := im(ϕ) where ϕ : e E(T ) C 4 C 4n 1 (P e A,Pe C,Pe G,Pe T ) e (q x1...x n ) {x1,...,x n x 1 + +x n=0} But 4n 1 1 {q A...A = 1}. Definition The Kimura variety of the phylogenetic tree T is W := V {q A...A = 1}. M. Casanellas (UPC) Using algebraic geometry... IMA, March 2007 16 / 36

Recursive construction of invariants Computational algebra software: even in Fourier parameterization, fail for 5 leaves. Theorem (Sturmfels Sullivant 05) For any group-based model, they give an explicit algorithm for obtaining the generators of the ideal of phylogenetic invariants of an n-leaved tree from those of an unrooted tree of 3 leaves. The generators are binomials of degree 4. M. Casanellas (UPC) Using algebraic geometry... IMA, March 2007 17 / 36

Outline 1 Algebraic evolutionary models 2 Phylogenetic inference using algebraic geometry 3 The geometry of the Kimura variety 4 New results on simulated data M. Casanellas (UPC) Using algebraic geometry... IMA, March 2007 18 / 36

Phylogenetic inference using algebraic geometry 1990-1995: Biologists claim that phylogenetic invariants are not useful for phylogenetic reconstruction. Lake only used linear invariants! In particular, a work of Huelsenbeck 95 reveals the inefficiency of Lake s method of invariants. (Eriksson 05) for General Markov Model. (C Garcia Sullivant 05) for Kimura 3-parameter. M. Casanellas (UPC) Using algebraic geometry... IMA, March 2007 19 / 36

Phylogenetic inference using algebraic geometry 1990-1995: Biologists claim that phylogenetic invariants are not useful for phylogenetic reconstruction. Lake only used linear invariants! In particular, a work of Huelsenbeck 95 reveals the inefficiency of Lake s method of invariants. (Eriksson 05) for General Markov Model. (C Garcia Sullivant 05) for Kimura 3-parameter. M. Casanellas (UPC) Using algebraic geometry... IMA, March 2007 19 / 36

Phylogenetic inference using algebraic geometry Model: Unrooted tree of 4 leaves, Kimura 3-parameters. Naive algorithm: Take 1 of the evaluation of all invariants on the point given by the relative frequencies of the columns of the alignment. Obtain a score for each possible topology and choose the one with smaller score. M. Casanellas (UPC) Using algebraic geometry... IMA, March 2007 20 / 36

Phylogenetic inference using algebraic geometry Model: Unrooted tree of 4 leaves, Kimura 3-parameters. Naive algorithm: Take 1 of the evaluation of all invariants on the point given by the relative frequencies of the columns of the alignment. Obtain a score for each possible topology and choose the one with smaller score. Huelsenbeck 95: assesses the performance of different phylogenetic reconstruction methods on the following tree space. M. Casanellas (UPC) Using algebraic geometry... IMA, March 2007 20 / 36

Results on simulated data (Huelsenbeck) Huelsenbeck s results: Length 100 Length 500 Length 1000 Lake invariants NJ ML Huelsenbeck, Syst. Biol., 1995 M. Casanellas (UPC) Using algebraic geometry... IMA, March 2007 21 / 36

C Fernandez-Sanchez studies on C Garcia Sullivant method Length 100 Length 500 Length 1000 M. Casanellas (UPC) Using algebraic geometry... IMA, March 2007 22 / 36

Why using phylogenetic invariants 1 We do not need to estimate parameters of the model. 2 The algebraic model allows species within the tree to evolve at different mutation rates In the non-algebraic version, one has: S i = exp(q t i ) where Q is a fixed matrix that represents the instantaneous mutation rate of all the species in the tree. In the algebraic model, one is allowing different rates at each branch of the tree. M. Casanellas (UPC) Using algebraic geometry... IMA, March 2007 23 / 36

For instance using algebraic geometry one should be able to reconstruct the biologically correct tree: dog human chimp rat mouse chicken (Al-Aidroos, Snir 05) chapter 21 in ASCB book. M. Casanellas (UPC) Using algebraic geometry... IMA, March 2007 24 / 36

For instance using algebraic geometry one should be able to reconstruct the biologically correct tree: dog human chimp rat mouse chicken (Al-Aidroos, Snir 05) chapter 21 in ASCB book. M. Casanellas (UPC) Using algebraic geometry... IMA, March 2007 24 / 36

For instance using algebraic geometry one should be able to reconstruct the biologically correct tree: dog human chimp rat mouse chicken dog human chimp rat mouse chicken (Al-Aidroos, Snir 05) chapter 21 in ASCB book. M. Casanellas (UPC) Using algebraic geometry... IMA, March 2007 24 / 36

Simulations on non-homogeneous trees (different rate matrices) M. Casanellas (UPC) Using algebraic geometry... IMA, March 2007 25 / 36

Outline 1 Algebraic evolutionary models 2 Phylogenetic inference using algebraic geometry 3 The geometry of the Kimura variety 4 New results on simulated data M. Casanellas (UPC) Using algebraic geometry... IMA, March 2007 26 / 36

The global geometry of the Kimura variety Recall that V := im(ϕ) where ϕ : C 4 C 4n 1 e E(T ) (P e A,Pe C,Pe G,Pe T ) e (q x1...x n ) {x1,...,x n x 1 + +x n=0} But 4n 1 1 {q A...A = 1}. Definition The Kimura variety of the phylogenetic tree T is W := V {q A...A = 1}. W = ϕ( e E(T ) (C4 {P e A = 1})). M. Casanellas (UPC) Using algebraic geometry... IMA, March 2007 27 / 36

Local complete intersection V A N algebraic variety, µ minimum number of generators of I(V ), then µ codim(v ) = N dim(v ). V is called a complete intersection if µ = codim(v ). Example: the Kimura variety W C 4n 1 1 has codimension 4 n 1 6n + 8, n = number of leaves. For n = 4, the minimum number of generators of the Kimura variety is 8002 whereas its codimension is 48! On a neighborhood of a smooth point, any variety is a local complete intersection (i.e. it can be defined by codim(v ) equations). Goal: 1 Find the singular points. 2 Provide a set of local generators at non-singular points. M. Casanellas (UPC) Using algebraic geometry... IMA, March 2007 28 / 36

Local complete intersection V A N algebraic variety, µ minimum number of generators of I(V ), then µ codim(v ) = N dim(v ). V is called a complete intersection if µ = codim(v ). Example: the Kimura variety W C 4n 1 1 has codimension 4 n 1 6n + 8, n = number of leaves. For n = 4, the minimum number of generators of the Kimura variety is 8002 whereas its codimension is 48! On a neighborhood of a smooth point, any variety is a local complete intersection (i.e. it can be defined by codim(v ) equations). Goal: 1 Find the singular points. 2 Provide a set of local generators at non-singular points. M. Casanellas (UPC) Using algebraic geometry... IMA, March 2007 28 / 36

Local complete intersection V A N algebraic variety, µ minimum number of generators of I(V ), then µ codim(v ) = N dim(v ). V is called a complete intersection if µ = codim(v ). Example: the Kimura variety W C 4n 1 1 has codimension 4 n 1 6n + 8, n = number of leaves. For n = 4, the minimum number of generators of the Kimura variety is 8002 whereas its codimension is 48! On a neighborhood of a smooth point, any variety is a local complete intersection (i.e. it can be defined by codim(v ) equations). Goal: 1 Find the singular points. 2 Provide a set of local generators at non-singular points. M. Casanellas (UPC) Using algebraic geometry... IMA, March 2007 28 / 36

Case n = 3 The parameterization ϕ is: ϕ : C 9 C 15 ((1,P 1 C,P1 G,P1 T ),(1,P2 C,P2 G,P2 T ),(1,P3 C,P3 G,P3 T )) (qxyz=p1 x P2 y P3 z ) {x+y+z=0} Let H = (Z 2 Z 2, ) and let (ε, δ) H act on C 9 sending (P 1, P 2, P 3 ) to ((1,εP 1 C,δP1 G,εδP1 T ),(1,εP2 C,δP2 G,εδP2 T ),(1,εP3 C,δP3 G,εδP3 T )). E.g. (ε, δ) = (1, 1), in probability parameters the group action permutes A C and G T. Proposition The Kimura variety W = im(ϕ) on T 3 is the affine GIT quotient C 9 //H. M. Casanellas (UPC) Using algebraic geometry... IMA, March 2007 29 / 36

Arbitrary n T unrooted n-leaved tree (2n 3 edges, n 2 interior nodes) Extending the action of H to the n 2 interior nodes of T, we have Theorem The Kimura variety W is isomorphic to the affine GIT quotient Corollary (C 3 ) 2n 3 //H n 2 W = im(ϕ), no closure needed. ϕ 1 (q) 4 n 2 and there is just one preimage with biological meaning, q W. We are able to determine the singular points. In particular, biologically meaningful points q W + := ϕ( 3 + ) are not singular. M. Casanellas (UPC) Using algebraic geometry... IMA, March 2007 30 / 36

Arbitrary n T unrooted n-leaved tree (2n 3 edges, n 2 interior nodes) Extending the action of H to the n 2 interior nodes of T, we have Theorem The Kimura variety W is isomorphic to the affine GIT quotient Corollary (C 3 ) 2n 3 //H n 2 W = im(ϕ), no closure needed. ϕ 1 (q) 4 n 2 and there is just one preimage with biological meaning, q W. We are able to determine the singular points. In particular, biologically meaningful points q W + := ϕ( 3 + ) are not singular. M. Casanellas (UPC) Using algebraic geometry... IMA, March 2007 30 / 36

Arbitrary n T unrooted n-leaved tree (2n 3 edges, n 2 interior nodes) Extending the action of H to the n 2 interior nodes of T, we have Theorem The Kimura variety W is isomorphic to the affine GIT quotient Corollary (C 3 ) 2n 3 //H n 2 W = im(ϕ), no closure needed. ϕ 1 (q) 4 n 2 and there is just one preimage with biological meaning, q W. We are able to determine the singular points. In particular, biologically meaningful points q W + := ϕ( 3 + ) are not singular. M. Casanellas (UPC) Using algebraic geometry... IMA, March 2007 30 / 36

Arbitrary n T unrooted n-leaved tree (2n 3 edges, n 2 interior nodes) Extending the action of H to the n 2 interior nodes of T, we have Theorem The Kimura variety W is isomorphic to the affine GIT quotient Corollary (C 3 ) 2n 3 //H n 2 W = im(ϕ), no closure needed. ϕ 1 (q) 4 n 2 and there is just one preimage with biological meaning, q W. We are able to determine the singular points. In particular, biologically meaningful points q W + := ϕ( 3 + ) are not singular. M. Casanellas (UPC) Using algebraic geometry... IMA, March 2007 30 / 36

In Fourier parameters: 3 + = In probability parameters this is transformed into: S = A C G T A C G T ( a b c d b a d c c d a b d c b a ) a + b + c + d = 1 a + b c d > 0 a b + c d > 0 a b c + d > 0 M. Casanellas (UPC) Using algebraic geometry... IMA, March 2007 31 / 36

In Fourier parameters: 3 + = In probability parameters this is transformed into: S = A C G T A C G T ( a b c d b a d c c d a b d c b a ) a + b + c + d = 1 a + b > 1/2 a + c > 1/2 a + d > 1/2 M. Casanellas (UPC) Using algebraic geometry... IMA, March 2007 31 / 36

Local complete intersection Case n = 3 (Sturmfels Sullivant 05): I(V ) is minimally generated by 16 cubics and 18 quartics. Lemma The following six quartics q AAA q ATT q TCG q TGC q ACC q AGG q TAT q TTA, q CCA q CTG q TAT q TGC q CAC q CGT q TCG q TTA, q AGG q ATT q CAC q CCA q AAA q ACC q CGT q CTG, q ACC q ATT q GAG q GGA q AAA q AGG q GCT q GTC, q CAC q CTG q GCT q GGA q CCA q CGT q GAG q GTC, q GGA q GTC q TAT q TCG q GAG q GCT q TGC q TTA generate a local complete intersection that defines W at each point q W +. It does not depend on the point q. Skip proof M. Casanellas (UPC) Using algebraic geometry... IMA, March 2007 32 / 36

Proof. 1 J = (f 1..., f 6 ) defines a variety X containing W and both have the same dimension. 2 The points q W + are smooth points of X and W. 3 Both varieties coincide locally. M. Casanellas (UPC) Using algebraic geometry... IMA, March 2007 33 / 36

Local complete intersection Arbitrary n quartics: quadrics: Q, non-redundant 2 2 minors from flattening Theorem J 3 J n 1 Q generate a local complete intersection that defines W at the biologically meaningful points q W +. M. Casanellas (UPC) Using algebraic geometry... IMA, March 2007 34 / 36

Local complete intersection Arbitrary n quartics: quadrics: Q, non-redundant 2 2 minors from flattening Theorem J 3 J n 1 Q generate a local complete intersection that defines W at the biologically meaningful points q W +. M. Casanellas (UPC) Using algebraic geometry... IMA, March 2007 34 / 36

Local complete intersection Arbitrary n quartics: J 3 quadrics: Q, non-redundant 2 2 minors from flattening Theorem J 3 J n 1 Q generate a local complete intersection that defines W at the biologically meaningful points q W +. M. Casanellas (UPC) Using algebraic geometry... IMA, March 2007 34 / 36

Local complete intersection Arbitrary n quartics: J 3 quadrics: Q, non-redundant 2 2 minors from flattening Theorem J 3 J n 1 Q generate a local complete intersection that defines W at the biologically meaningful points q W +. M. Casanellas (UPC) Using algebraic geometry... IMA, March 2007 34 / 36

Local complete intersection Arbitrary n quartics: J 3 J n 1 quadrics: Q, non-redundant 2 2 minors from flattening Theorem J 3 J n 1 Q generate a local complete intersection that defines W at the biologically meaningful points q W +. M. Casanellas (UPC) Using algebraic geometry... IMA, March 2007 34 / 36

Local complete intersection Arbitrary n quartics: J 3 J n 1 quadrics: Q, non-redundant 2 2 minors from flattening Theorem J 3 J n 1 Q generate a local complete intersection that defines W at the biologically meaningful points q W +. M. Casanellas (UPC) Using algebraic geometry... IMA, March 2007 34 / 36

Local complete intersection Arbitrary n quartics: J 3 J n 1 quadrics: Q, non-redundant 2 2 minors from flattening Theorem J 3 J n 1 Q generate a local complete intersection that defines W at the biologically meaningful points q W +. M. Casanellas (UPC) Using algebraic geometry... IMA, March 2007 34 / 36

Outline 1 Algebraic evolutionary models 2 Phylogenetic inference using algebraic geometry 3 The geometry of the Kimura variety 4 New results on simulated data M. Casanellas (UPC) Using algebraic geometry... IMA, March 2007 35 / 36

New results on simulated data 4 leaves Instead of all generators, 48 polynomials that generate locally Length 100 Length 500 Length 1000 M. Casanellas (UPC) Using algebraic geometry... IMA, March 2007 36 / 36

New results on simulated data 4 leaves Instead of all generators, 48 polynomials that generate locally Length 100 Length 500 Length 1000 M. Casanellas (UPC) Using algebraic geometry... IMA, March 2007 36 / 36

New results on simulated data 4 leaves Instead of all generators, 48 polynomials that generate locally Length 100 Length 500 Length 1000 (Eriksson Yao 07) Machine learning approach, 52 invariants that perform better on this parameter space. M. Casanellas (UPC) Using algebraic geometry... IMA, March 2007 36 / 36