Recent Advances in Phylogeny Reconstruction from Gene-Order Data Bernard M.E. Moret Department of Computer Science University of New Mexico Albuquerque, NM 87131 Department Colloqium p.1/41
Collaborators and Support Collaborators: University of Texas, Austin: Tandy Warnow (Computer Science) David Hillis, Robert Jansen, Randy Linder (Biology) University of New Mexico: David Bader (Electrical & Comp. Eng.) Funding: National Science Foundation, at UNM: 6 grants for $2 million over 5 years with UT Austin: 10 grants for $8 million Department Colloqium p.2/41
Overview Phylogenies Department Colloqium p.3/41
Overview Phylogenies Gene-order data: mitochondrion and chloroplast genomes Department Colloqium p.3/41
Overview Phylogenies Gene-order data: mitochondrion and chloroplast genomes Inversion and other genomic distance measures Department Colloqium p.3/41
Overview Phylogenies Gene-order data: mitochondrion and chloroplast genomes Inversion and other genomic distance measures Estimating the true evolutionary distance Department Colloqium p.3/41
Overview Phylogenies Gene-order data: mitochondrion and chloroplast genomes Inversion and other genomic distance measures Estimating the true evolutionary distance Fast convergence for reconstruction methods Department Colloqium p.3/41
Overview Phylogenies Gene-order data: mitochondrion and chloroplast genomes Inversion and other genomic distance measures Estimating the true evolutionary distance Fast convergence for reconstruction methods GRAPPA news Department Colloqium p.3/41
Phylogenies A phylogeny is a reconstruction of the evolutionary history of a collection of organisms; it usually takes the form of a tree. Modern organisms are placed at the leaves and ancestral organisms occupy internal nodes. The edges of the tree denote evolutionary relationships. Department Colloqium p.4/41
12 Species of Campanulaceae 2.42 1.75 4.25 Wahlenbergia Merciera 4.34 1.61 0.83 0.23 0.77 0.063 0.94 0.18 2.82 Trachelium Symphyandra Campanula Adenophora 0.78 2.59 1.28 3.22 3.39 1.61 Legousia Asyneuma Triodanus 2.22 4.68 3.32 Codonopsis Cyananthus 10.75 Platycodon 2.25 Tobacco Department Colloqium p.5/41
Herpes Viruses that Affect Humans HVS EHV2 KHSV EBV HSV1 HSV2 PRV EHV1 HHV6 VZV HHV7 HCMV Department Colloqium p.6/41
A Large Phylogeny: 500 Green Plants Department Colloqium p.7/41
Reconstructing Phylogenies Reconstructing phylogenies is a major component of modern research programs in many areas of biology and medicine: pharmaceutical research for drug discovery (most famous is herbicide Roundup TM ) Department Colloqium p.8/41
Reconstructing Phylogenies Reconstructing phylogenies is a major component of modern research programs in many areas of biology and medicine: pharmaceutical research for drug discovery (most famous is herbicide Roundup TM ) understanding rapidly mutating viruses (HIV) Department Colloqium p.8/41
Reconstructing Phylogenies Reconstructing phylogenies is a major component of modern research programs in many areas of biology and medicine: pharmaceutical research for drug discovery (most famous is herbicide Roundup TM ) understanding rapidly mutating viruses (HIV) designing enhanced organisms (rice, wheat) Department Colloqium p.8/41
Reconstructing Phylogenies Reconstructing phylogenies is a major component of modern research programs in many areas of biology and medicine: pharmaceutical research for drug discovery (most famous is herbicide Roundup TM ) understanding rapidly mutating viruses (HIV) designing enhanced organisms (rice, wheat) explaining and predicting gene expression Department Colloqium p.8/41
Reconstructing Phylogenies Reconstructing phylogenies is a major component of modern research programs in many areas of biology and medicine: pharmaceutical research for drug discovery (most famous is herbicide Roundup TM ) understanding rapidly mutating viruses (HIV) designing enhanced organisms (rice, wheat) explaining and predicting gene expression explaining and predicting ligands Department Colloqium p.8/41
Reconstructing Phylogenies Reconstructing phylogenies is a major component of modern research programs in many areas of biology and medicine: pharmaceutical research for drug discovery (most famous is herbicide Roundup TM ) understanding rapidly mutating viruses (HIV) designing enhanced organisms (rice, wheat) explaining and predicting gene expression explaining and predicting ligands most centrally, understanding genomic evolution Department Colloqium p.8/41
Reconstructing Phylogenies (cont d) Requires a model of tree evolution (e.g., random or birth-death) Requires a model of DNA/RNA/codon/gene order/etc. evolution (e.g., Markov models with weights matrices such as Jukes-Cantor and Kimura) Requires an optimization criterion that relates to the previous two models (e.g., likelihood or parsimony) Requires data with sufficient signal (to recover defining information) Department Colloqium p.9/41
Computational Phylogenetics Is extremely computation-intensive. Is viewed very differently by biologists (one dataset only, accuracy first) and by computer scientists (efficiency first) Department Colloqium p.10/41
Computational Phylogenetics Is extremely computation-intensive. Is viewed very differently by biologists (one dataset only, accuracy first) and by computer scientists (efficiency first) Sequence data (RNA, DNA, and aminoacid) has been used for over 20 years and is fairly well understood, but methods do not scale up. Genomic data (gene order and content of whole genomes) provides new information, but is much harder to analyze than sequence data. Department Colloqium p.10/41
Gene-Order Data Certain genomes evolve mostly through rearrangement of the order of genes, with occasional gene duplication or gene loss. A chloroplast is a semi-independent organism that lives within plant cells and allows them to photosynthesize. Chloroplasts have one circular chromosome with 120 genes. A mitochondrion is a semi-independent organism that lives within animal and some plant cells and supplies them with energy. Mitochondria have one circular chromosome with 40 genes in animals, more in plants. Department Colloqium p.11/41
Mitochondria Homo sapiens Felis catus Lumbricus terrestris Saccharomyces cerevisiae Department Colloqium p.12/41
Chloroplasts Cyanidium caldarium Zea mays Department Colloqium p.13/41
Phylogenies from Gene-Order Data Optimization target: reconstruct the phylogeny with the least total number of genomic changes. An application of Occam s razor; biologists call this the principle of parsimony. Department Colloqium p.14/41
True Evolutionary Distances True Evolutionary Distance (T.E.D.): actual number of events along an edge of the tree. Edit Distance: minimum number of events from one end of a tree edge to the other. We obtain better topological accuracy with T.E.D.s than with Edit Distances. T.E.D. can only be estimated. Department Colloqium p.15/41
True Evolutionary Distance A B 2 1 4 6 3 D C Polynomial Time A B C D A B C D 0 3 12 9 0 11 8 0 9 0 The tree and, a fortiori, its edge lengths are not known. Department Colloqium p.16/41
Rearrangement Events 2 1 8 5 1 8 3 7 Transposition 6 7 4 6 5 2 3 4 Inversion Inverted Transposition 4 1 8 5 1 8 3 7 6 7 2 5 6 4 3 2 Department Colloqium p.17/41
Generalized Nadeau-Taylor Model Inversions, Transpositions, and Inverted Transpositions All events of the same type are equiprobable Assign probabilities to different event types: Transposition: α Inverted Transposition: β Inversion: 1 α β Department Colloqium p.18/41
Breakpoint Distance D BP (G, G ) = No. of breakpoints in G w.r.t G G=(1 2 3 4 5 6 7 8) G =(1 2 5 4 3 6 7 8) Department Colloqium p.19/41
Genomic Distances BP: Breakpoint distance INV [Moret, Bader, Yan WADS 2001]: Minimum number of inversions required to transform one genome to another, IEBP [Wang, Warnow STOC 01]: Approximate the expected breakpoint distance with provable error. Exact IEBP [Wang WABI 01]: Invert the expected breakpoint distance EDE [Moret, Wang, Warnow, Wyman ISMB 01]: Estimate the expected inversion distance using simulation data. Department Colloqium p.20/41
Exact IEBP: Basic Idea Let G 0 be the starting genome and G k be the genome after k events. For every k > 0 compute E[D BP (G k, G 0 )], the expected number of breakpoints after k events. Return k that minimizes E[D BP (G k, G 0 )] D BP (G, G ). Department Colloqium p.21/41
The Counting Lemma ι n (u, v) = τ n (u, v) = ν n (u, v) = min{ u 1, v 1, n + 1 u, n + 1 v } (if uv < 0) 0 ( u 1 2 ) + ( n+1 u 2 ) (if u v, uv > 0) (if u = v) 0 (if uv < 0) (min{ u, v } 1)(n + 1 max{ u, v }) ( (if u v, uv > 0) n+1 u ) ( 3 + u 1 ) 3 (n 2)ι n (u, v) τ n (u, v) 3τ n (u, v) (if u = v) (if uv < 0) (if u v, uv > 0) (if u = v) Department Colloqium p.22/41
Goodness of Fit of Distance Estimators Inversion only on 120 genes 300 300 Actual number of events 250 200 150 100 50 Actual number of events 250 200 150 100 50 0 0 100 200 300 Inversion Distance Inversion distance 300 0 0 100 200 300 Breakpoint Distance Breakpoint distance 300 Actual number of events 250 200 150 100 50 Actual number of events 250 200 150 100 50 0 0 100 200 300 Exact IEBP Distance Exact-IEBP distance 0 0 100 200 300 Measured Distance Ideal estimator Department Colloqium p.23/41
Goodness of Fit of Distance Estimators Inversion only on 120 genes 300 300 Actual number of events 250 200 150 100 50 Actual number of events 250 200 150 100 50 0 0 100 200 300 IEBP Distance IEBP distance 300 0 0 100 200 300 EDE Distance EDE distance 300 Actual number of events 250 200 150 100 50 Actual number of events 250 200 150 100 50 0 0 100 200 300 Exact IEBP Distance Exact-IEBP distance 0 0 100 200 300 Measured Distance Ideal estimator Department Colloqium p.24/41
Absolute Error of Distance Estimators Absolute difference 300 250 200 150 100 BP INV IEBP EDE Exact IEBP 50 0 0 100 200 300 Actual number of events Inversion only Department Colloqium p.25/41
Absolute Error of Distance Estimators Absolute difference 300 250 200 150 100 BP INV IEBP EDE Exact IEBP 50 0 0 100 200 300 Actual number of events Transpositions only Department Colloqium p.26/41
Absolute Error of Distance Estimators Absolute difference 300 250 200 150 100 BP INV IEBP EDE Exact IEBP 50 0 0 100 200 300 Actual number of events All three classes equiprobable Department Colloqium p.27/41
Accuracy of Neighbor Joining 120 genes, inversion only, 10/20/40/80/160 genomes False Negative Rate (%) 70 60 50 40 30 20 NJ(BP) NJ(INV) NJ(IEBP) NJ(EDE) NJ(Exact IEBP) 10 0 0 0.2 0.4 0.6 0.8 1 Normalized Maximum Pairwise Inversion Distance Department Colloqium p.28/41
Accuracy of Neighbor Joining 120 genes, equiprobable events, 10/20/40/80/160 genomes False Negative Rate (%) 70 60 50 40 30 20 NJ(BP) NJ(INV) NJ(IEBP) NJ(EDE) NJ(Exact IEBP) 10 0 0 0.2 0.4 0.6 0.8 1 Normalized Maximum Pairwise Inversion Distance Department Colloqium p.29/41
Robustness of Exact-IEBP 120 genes, inversion only, 10/20/40/80/160 genomes 70 60 NJ(Exact IEBP(0,0)) NJ(Exact IEBP(1,0)) NJ(Exact IEBP(1/3,1/3)) False Negative Rate (%) 50 40 30 20 10 0 0 0.2 0.4 0.6 0.8 1 Normalized Maximum Pairwise Inversion Distance Department Colloqium p.30/41
Robustness of Exact-IEBP 120 genes, equiprobable events, 10/20/40/80/160 genomes 70 60 NJ(Exact IEBP(0,0)) NJ(Exact IEBP(1,0)) NJ(Exact IEBP(1/3,1/3)) False Negative Rate (%) 50 40 30 20 10 0 0 0.2 0.4 0.6 0.8 1 Normalized Maximum Pairwise Inversion Distance Department Colloqium p.31/41
Convergence Rate A method is statistically consistent for a given model if, given long enough data sequences, it recovers the true tree with high probability. Department Colloqium p.32/41
Convergence Rate A method is statistically consistent for a given model if, given long enough data sequences, it recovers the true tree with high probability. Problem: long enough" sequences may not exist in nature. Department Colloqium p.32/41
Convergence Rate A method is statistically consistent for a given model if, given long enough data sequences, it recovers the true tree with high probability. Problem: long enough" sequences may not exist in nature. Solution: a method is fast-converging for a given model if, given sequences of polynomial length, it recovers the true tree with high probability. Department Colloqium p.32/41
Convergence Rate A method is statistically consistent for a given model if, given long enough data sequences, it recovers the true tree with high probability. Problem: long enough" sequences may not exist in nature. Solution: a method is fast-converging for a given model if, given sequences of polynomial length, it recovers the true tree with high probability. Problem: the model conditions may not hold. Department Colloqium p.32/41
Convergence Rate A method is statistically consistent for a given model if, given long enough data sequences, it recovers the true tree with high probability. Problem: long enough" sequences may not exist in nature. Solution: a method is fast-converging for a given model if, given sequences of polynomial length, it recovers the true tree with high probability. Problem: the model conditions may not hold. Solution: a method is absolute fast-converging if, given sequences of polynomial length, it recovers the true tree with high probability. Department Colloqium p.32/41
Known Fast-Converging Methods The short-quartet methods [Warnow et al.]: absolute fast-converging The disk-covering methods (DCM) [Warnow et al.]: absolute fast-converging The harmonic greedy triplet method [Kao et al.] The method of Cryan, Goldberg, and Golbderg DCM-boosted neighbor-joining [Warnow et al.] Department Colloqium p.33/41
New Results [Warnow, Moret, St. John SODA 01] New absolute fast-converging method: weighted witness-antiwitness method (WIGWAM) Decision procedure to turn fast-converging methods into absolute fast-converging methods: short-quartet support (SQS) Boosting method (DCM plus SQS) to turn many methods with exponential convergence (e.g., neighbor-joining) into absolute fast-converging ones Generalizations to families of boosting methods with same properties, but experimental behavior Department Colloqium p.34/41
What is a Quartet? A quartet is an unrooted binary tree on four taxa the smallest tree that induces a nontrivial bipartition. b a {ab cd} d c c a {ac bd} d b d a {ad bc} A quartet {ab cd} agrees with a tree T if the subtree induced in T by the four taxa is the quartet itself. c b Department Colloqium p.35/41
Fast Convergence: Decision Problem TRUE TREE SELECTION PROBLEM: Input: A set S of sequences over A, C, T, G generated on an unknown tree (T, M), and a collection T = {T 1, T 2,..., T p } of phylogenies on S. Output: The true tree T if T is in T Department Colloqium p.36/41
Quartet Support Let T be a fixed tree leaf-labelled by the set S Let Q a fixed set of quartets on S Let D be the distance matrix on S The support of T with respect to Q is max{l (q Q and diam D (q) l) = q Q(T )} Department Colloqium p.37/41
Short Quartet Support PROCEDURE SQS(T, S) For each set of four taxa from S, compute the neighbor-joining quartet q; let Q be the set of all such quartets. Return T i such that s(t i, Q) is maximum; if more than one such tree exists, return the one with the smallest index i. Department Colloqium p.38/41
SQS Theorem For all ε > 0, there is a polynomial p such that, for all (T, M) in the model on set S of n sequences generated at random on T with length at least p(n), we have whenever T is in T. P r[sqs(t, S) = T ] > 1 ε Department Colloqium p.39/41
GRAPPA News: More Speed! Current release (1.03) runs from 2,000 to 10,000 times faster than the original tool, while also giving more capabilities. Department Colloqium p.40/41
GRAPPA News: More Speed! Current release (1.03) runs from 2,000 to 10,000 times faster than the original tool, while also giving more capabilities. Research version (1.1) runs from 10,000 to 500,000 times faster than the original tool, thanks to much better bounding. Department Colloqium p.40/41
GRAPPA News: More Speed! Current release (1.03) runs from 2,000 to 10,000 times faster than the original tool, while also giving more capabilities. Research version (1.1) runs from 10,000 to 500,000 times faster than the original tool, thanks to much better bounding. The 13-genome Campanulaceae now takes a few hours on a laptop instead of a few centuries on a large workstation. Department Colloqium p.40/41
GRAPPA News: More Speed! Current release (1.03) runs from 2,000 to 10,000 times faster than the original tool, while also giving more capabilities. Research version (1.1) runs from 10,000 to 500,000 times faster than the original tool, thanks to much better bounding. The 13-genome Campanulaceae now takes a few hours on a laptop instead of a few centuries on a large workstation. Speedup on Los Lobos is over 200,000,000! Department Colloqium p.40/41
Other Recent Results New sequence encodings for gene orders to enable classical parsimony searches. Department Colloqium p.41/41
Other Recent Results New sequence encodings for gene orders to enable classical parsimony searches. Combinations of fast-converging boosters with new encodings (i.e., use a new encoding and run a DCM+SQS booster on a classical parsimony optimizer): best accuracy to date. Department Colloqium p.41/41
Other Recent Results New sequence encodings for gene orders to enable classical parsimony searches. Combinations of fast-converging boosters with new encodings (i.e., use a new encoding and run a DCM+SQS booster on a classical parsimony optimizer): best accuracy to date. Combinations of fast-converging boosters with new encodings and fast heuristics (e.g., neighbor-joining): best speed/accuracy tradeoff to date. Department Colloqium p.41/41
Other Recent Results New sequence encodings for gene orders to enable classical parsimony searches. Combinations of fast-converging boosters with new encodings (i.e., use a new encoding and run a DCM+SQS booster on a classical parsimony optimizer): best accuracy to date. Combinations of fast-converging boosters with new encodings and fast heuristics (e.g., neighbor-joining): best speed/accuracy tradeoff to date. New results on computing inversion distances, inversion medians, etc. Department Colloqium p.41/41