Jed Chou. April 13, 2015

Similar documents
Species Tree Inference using SVDquartets

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees

Quartet Inference from SNP Data Under the Coalescent Model

first (i.e., weaker) sense of the term, using a variety of algorithmic approaches. For example, some methods (e.g., *BEAST 20) co-estimate gene trees

Phylogenetic Geometry

ASTRAL: Fast coalescent-based computation of the species tree topology, branch lengths, and local branch support

Introduction to Algebraic Statistics

Evolutionary Tree Analysis. Overview

Anatomy of a species tree

CS 581 Paper Presentation

Weighted Quartets Phylogenetics

Algebraic Statistics Tutorial I

STEM-hy: Species Tree Estimation using Maximum likelihood (with hybridization)

Large Scale Data Analysis Using Deep Learning

Upcoming challenges in phylogenomics. Siavash Mirarab University of California, San Diego

Phylogenetic invariants versus classical phylogenetics

Phylogenetic Algebraic Geometry

Linear Algebra (Review) Volker Tresp 2018

Identifiability of the GTR+Γ substitution model (and other models) of DNA evolution

Workshop III: Evolutionary Genomics

Introduction to Numerical Linear Algebra II

ELIZABETH S. ALLMAN and JOHN A. RHODES ABSTRACT 1. INTRODUCTION

Singular Value Decomposition

Properties of Consensus Methods for Inferring Species Trees from Gene Trees

Singular Value Decompsition

Linear Algebra Review. Vectors

Molecular Evolution & Phylogenetics

Phylogenetic Tree Reconstruction

Reconstruction of species trees from gene trees using ASTRAL. Siavash Mirarab University of California, San Diego (ECE)

Inferring phylogenetic networks with maximum pseudolikelihood under incomplete lineage sorting

Linear Algebra (Review) Volker Tresp 2017

Knowledge Discovery and Data Mining 1 (VO) ( )

19 Tree Construction using Singular Value Decomposition

Homework 1. Yuan Yao. September 18, 2011

Phylogenetic Networks, Trees, and Clusters

X X (2) X Pr(X = x θ) (3)

Ir O D = D = ( ) Section 2.6 Example 1. (Bottom of page 119) dim(v ) = dim(l(v, W )) = dim(v ) dim(f ) = dim(v )

arxiv: v1 [q-bio.pe] 3 May 2016

Fast coalescent-based branch support using local quartet frequencies

Quantum Computing Lecture 2. Review of Linear Algebra

Elementary linear algebra

arxiv: v1 [q-bio.pe] 1 Jun 2014

Chapter 2 Spectra of Finite Graphs

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology

Multi-Linear Mappings, SVD, HOSVD, and the Numerical Solution of Ill-Conditioned Tensor Least Squares Problems

ELEMENTARY SUBALGEBRAS OF RESTRICTED LIE ALGEBRAS

Combinatorial Aspects of Tropical Geometry and its interactions with phylogenetics

(1) for all (2) for all and all

Bruce Hendrickson Discrete Algorithms & Math Dept. Sandia National Labs Albuquerque, New Mexico Also, CS Department, UNM

AMS526: Numerical Analysis I (Numerical Linear Algebra)

Open Problems in Algebraic Statistics

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

Matrices and Vectors

Math 321: Linear Algebra

Inferring Phylogenetic Trees. Distance Approaches. Representing distances. in rooted and unrooted trees. The distance approach to phylogenies

Linear Algebra and Dirac Notation, Pt. 3

The Generalized Neighbor Joining method

Dimension reduction, PCA & eigenanalysis Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining

Linear Algebra Review

arxiv: v1 [cs.cc] 9 Oct 2014

AM 205: lecture 8. Last time: Cholesky factorization, QR factorization Today: how to compute the QR factorization, the Singular Value Decomposition

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Quick Tour of Linear Algebra and Graph Theory

MATH 215B HOMEWORK 5 SOLUTIONS

Singular Value Decomposition and Principal Component Analysis (PCA) I

EE/ACM Applications of Convex Optimization in Signal Processing and Communications Lecture 2

Maths for Signals and Systems Linear Algebra in Engineering

EE731 Lecture Notes: Matrix Computations for Signal Processing

Statistical Inference on Large Contingency Tables: Convergence, Testability, Stability. COMPSTAT 2010 Paris, August 23, 2010

Math 321: Linear Algebra

Spring 2018 CIS 610. Advanced Geometric Methods in Computer Science Jean Gallier Homework 3

arxiv: v1 [q-bio.pe] 16 Aug 2007

Transportation Problem

Singular Value Decomposition

We use the overhead arrow to denote a column vector, i.e., a number with a direction. For example, in three-space, we write

Parameterizing orbits in flag varieties

Probabilistic Latent Semantic Analysis

Multilinear Singular Value Decomposition for Two Qubits

Review of Some Concepts from Linear Algebra: Part 2

Constructing Evolutionary/Phylogenetic Trees

In comparisons of genomic sequences from multiple species, Challenges in Species Tree Estimation Under the Multispecies Coalescent Model REVIEW

The impact of missing data on species tree estimation

Linear Algebra Massoud Malek

ACO Comprehensive Exam October 14 and 15, 2013

Materials engineering Collage \\ Ceramic & construction materials department Numerical Analysis \\Third stage by \\ Dalya Hekmat

Math Bootcamp An p-dimensional vector is p numbers put together. Written as. x 1 x =. x p

Bare minimum on matrix algebra. Psychology 588: Covariance structure and factor models

Estimating Evolutionary Trees. Phylogenetic Methods

Matrix Representation

CS60021: Scalable Data Mining. Dimensionality Reduction

Notes on the Matrix-Tree theorem and Cayley s tree enumerator

A Least Squares Formulation for Canonical Correlation Analysis

More Linear Algebra. Edps/Soc 584, Psych 594. Carolyn J. Anderson

arxiv: v1 [math.st] 22 Jun 2018

Phylogeny: traditional and Bayesian approaches

Linear Algebra. Shan-Hung Wu. Department of Computer Science, National Tsing Hua University, Taiwan. Large-Scale ML, Fall 2016

5.6. PSEUDOINVERSES 101. A H w.

PROBLEMS ON LINEAR ALGEBRA

.. CSC 566 Advanced Data Mining Alexander Dekhtyar..

Transcription:

of of CS598 AGB April 13, 2015

Overview of 1 2 3 4 5

Competing Approaches of Two competing approaches to species tree inference: Summary methods: estimate a tree on each gene alignment then combine gene trees. - MP-est - NJst - ASTRAL-II Concatenation: concatenate all gene alignments and estimate species tree on resulting supermatrix. - CA-ML (RAxML)

Challenges for Summary and Concatenation of Challenges for summary methods: Long alignments are unreliable due to recombination Gene trees estimated on short alignments have estimation error Summary methods are sensitive to gene tree estimation error Challenges for concatenation: Ignores incomplete lineage sorting Can be statistically inconsistent under the coalescent model (Roch and Steel, 2014)

κ-state GTR model of The κ-state analytic General Time-Reversible model C GTR(κ) has parameters: Notation species tree topology S vector of speciation times τ = (τ 1, τ 2,..., τ n 1 ) effective population size θ substitution matrix P parameter space U S R M probability simplex κ4 1 = {(p 1,..., p κ 4) R κ4 κ 4 i=1 p i = 1 and p i 0} parameterization map ψ S : U S κ4 1

Splits of Definition A split L 1 L 2 of a set L of taxa is a bipartition of L into two non-overlapping subsets L 1 and L 2 ; it is valid for a tree T if it is induced by an edge, i.e. if L 1 L 2 C(T ). Alternatively, L 1 L 2 is valid if the subtrees on L 1 and L 2 do not intersect.

Flattenings of Definition Let L 1 L 2 be a split and P ψ S (U S ), a κ...κ tensor. A flattening of P, Flat L1 L 2 (P), is a κ L 1 κ L 2 matrix whose rows are indexed by possible states for the leaves in L 1 and columns by possible states in L 2. Example For a 4-taxon tree and split ad bc paaaa p AACA p ATTA Flat ad bc (P(S,τ) ) = paaac paacc p ATTC...... ptaat ptact p TTTT

Invariants of Definition An invariant is a function in the site pattern probabilities that vanishes when evaluated on any distribution arising from the model. Examples For any species tree (S, τ) on n taxa under a κ-state GTR model, i j [κ] p i 1...i n (S,τ) 1 = 0. For a species tree S = ((a, (b, (c, d))), p ij (S,τ) p ji (S,τ) = 0.

A Lemma of Definition Let R = {f 1,..., f n } be a set of analytic functions on a connected, open set D R m. The analytic variety V (R) is the set of common zeros of f 1,..., f n : Definition V (R) = {x D f i (x) = 0, 1 i n} Fix a coalescent phylogenetic model ψ S : U S κ4 1. An analytic function f is called a coalescent phylogenetic invariant if f (x) = 0 for all x ψ S (U S ).

of Let (S, τ) be a 4-taxon symmetric or asymmetric species tree with cherry (c, d). Identify the parameter space U S with a full dimensional subset of R M. Theorem 1 If L 1 L 2 is a valid split for S, then for all distributions P (S,τ), ( ) κ + 1 rank(flat L1 L 2 (P(S,τ) )) 2 Theorem 2 If L 1 L 2 is not a valid split for S, then generically ( ) κ + 1 rank(flat L1 L 2 (P(S,τ) )) > 2

Sketch of Proof Theorem 1 of 1 Suppose L 1 L 2 is a valid split for S. 2 Both symmetric and asymmetric 4-taxon trees have cherry (c, d), so columns in Flat L1 L 2 (P(S,τ) ) labeled by the cd-indices kl and lk are identical for k l [κ]. 3 There are ( κ 2) such pairs, so rank(flat L1 L 2 (P(S,τ) )) κ2 ( ) ( κ 2 = κ+1 ) 2.

Sketch of Proof Theorem 2 of 1 Suppose L 1 L 2 is not a valid split for S. 2 Let X ac bd and X ad bc be the sets of all ( ( ) κ+1 2 ) + 1)-minors of Flatac bd and Flat ad bc respectively and let V ac bd = V (X ac bd ), V ad bc = V (X ad bc ). 3 Note: The rank of a matrix is the maximal order of a non-zero minor. 4 Let W = V ({f ψ S }) where f : κ4 1 R are analytic functions that vanish on V ac bd V ad bc. 5 W is an analytic subvariety of U S R M and in fact it is proper. 6 So dim(w ) < dim(u S ) and W has measure 0 in U S.

Sketch of Proof Theorem 2 of

Frobenius Norm of Definition The Frobenius Norm of an n m matrix A is m n A F = Singular Values i=1 j=1 The singular values of a square matrix A are the square roots of the eigenvalues of A A where A is the conjugate transpose of A. a 2 ij

Some Linear Algebra of Theorem p A F = i=1 where σ 1 σ 2...σ p are the singular values of A and p = min{m, n}. Theorem (Eckart-Young, 1936) For any k p, min A B F = rank(b)=k σ 2 i p i=k+1 σ 2 i

SVD score of Definition The SVD score of a split L 1 L 2 for a 4-state GTR model species tree on 4 taxa is SVD(L 1 L 2 ) = 16 Consequences i=11 For a valid split L 1 L 2, rank(flat L1 L 2 (P(S,τ) )) ( 5 2), so σ 11 =... = σ 16 = 0 and SVD(L 1 L 2 ) = 0. For a non-valid split L 1 L 2, rank(flat L1 L 2 (P(S,τ) )) > ( 5 2), so σ 11 0 and SVD(L 1 L 2 ) > 0. σ 2 i

Inferring the Species Tree of To infer the species tree on n taxa given a collection of gene alignments: 1 For a set L = {a, b, c, d} in S, estimate the flattening matrices of site pattern probabilities for the 3 possible splits by picking a single site from each gene alignment and counting the frequencies of all 256 site patterns (AAAA, AAAT, etc.) among the selected sites. 2 Compute the SVD score of each split L 1 L 2 from these flattening matrices. 3 Pick the split with the SVD score closest to 0 and return the associated quartet tree. Do this for every set of 4 taxa in S to get a collection of quartet trees. 4 Combine all quartet trees with a quartet method.

of We are currently studying combined with quartet methods QMC and wqmc. Some directions for further research include: Rigorous experiments under a variety of conditions Comparison to summary methods and concatention Quartet subsampling to reduce running time Better approaches to select sites from each gene alignment Statistical consistency in the presence of HGT or ILS+HGT

References of Julia Chifman and Laura Kubatko (2014) Identifiability of the unrooted species tree topology under the coalescent model with time-reversible substitution processes arxiv:1406.4811v1 Julia Chifman and Laura Kubatko (2014) Quartet Inference from SNP Data Under the Coalescent Model Bioinformatics Advance Access