Phylogenetic Reconstruction from Gene-Order Data

Similar documents
Phylogenetic Reconstruction

Mathematics of Evolution and Phylogeny. Edited by Olivier Gascuel

Recent Advances in Phylogeny Reconstruction

Phylogenetic Reconstruction: Handling Large Scale

A New Fast Heuristic for Computing the Breakpoint Phylogeny and Experimental Phylogenetic Analyses of Real and Synthetic Data

Advances in Phylogeny Reconstruction from Gene Order and Content Data

BIOINFORMATICS. Scaling Up Accurate Phylogenetic Reconstruction from Gene-Order Data. Jijun Tang 1 and Bernard M.E. Moret 1

Improving Tree Search in Phylogenetic Reconstruction from Genome Rearrangement Data

Fast Phylogenetic Methods for the Analysis of Genome Rearrangement Data: An Empirical Study

AN EXACT SOLVER FOR THE DCJ MEDIAN PROBLEM

TheDisk-Covering MethodforTree Reconstruction

Isolating - A New Resampling Method for Gene Order Data

BIOINFORMATICS. New approaches for reconstructing phylogenies from gene order data. Bernard M.E. Moret, Li-San Wang, Tandy Warnow and Stacia K.

Steps Toward Accurate Reconstructions of Phylogenies from Gene-Order Data 1

A Framework for Orthology Assignment from Gene Rearrangement Data

A Practical Algorithm for Ancestral Rearrangement Reconstruction

CS 394C Algorithms for Computational Biology. Tandy Warnow Spring 2012

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

CS 581 Algorithmic Computational Genomics. Tandy Warnow University of Illinois at Urbana-Champaign

Dr. Amira A. AL-Hosary

Phylogenetic Reconstruction from Arbitrary Gene-Order Data

Jijun Tang and Bernard M.E. Moret. Department of Computer Science, University of New Mexico, Albuquerque, NM 87131, USA

New Approaches for Reconstructing Phylogenies from Gene Order Data

CS 581 Algorithmic Computational Genomics. Tandy Warnow University of Illinois at Urbana-Champaign

Phylogenetic Tree Reconstruction

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

Reversing Gene Erosion Reconstructing Ancestral Bacterial Genomes from Gene-Content and Order Data

GASTS: Parsimony Scoring under Rearrangements

Constructing Evolutionary/Phylogenetic Trees

On Reversal and Transposition Medians

Analysis of Gene Order Evolution beyond Single-Copy Genes

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees

Biological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor

Molecular phylogeny - Using molecular sequences to infer evolutionary relationships. Tore Samuelsson Feb 2016

The combinatorics and algorithmics of genomic rearrangements have been the subject of much

SEPP and TIPP for metagenomic analysis. Tandy Warnow Department of Computer Science University of Texas

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

Phylogenetic inference

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

An Improved Algorithm for Ancestral Gene Order Reconstruction

Constructing Evolutionary/Phylogenetic Trees

A Phylogenetic Network Construction due to Constrained Recombination

CHAPTERS 24-25: Evidence for Evolution and Phylogeny

Phylogenetic Networks, Trees, and Clusters

High-Performance Algorithm Engineering for Large-Scale Graph Problems and Computational Biology

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

Algorithms for Bioinformatics

On the complexity of unsigned translocation distance

Chapter 26: Phylogeny and the Tree of Life Phylogenies Show Evolutionary Relationships

A Bayesian Analysis of Metazoan Mitochondrial Genome Arrangements

SEPP and TIPP for metagenomic analysis. Tandy Warnow Department of Computer Science University of Texas

Evolution of Tandemly Arrayed Genes in Multiple Species

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University

Evolutionary Tree Analysis. Overview

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

Effects of Gap Open and Gap Extension Penalties

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Biology 211 (2) Week 1 KEY!

BINF6201/8201. Molecular phylogenetic methods

17 Non-collinear alignment Motivation A B C A B C A B C A B C D A C. This exposition is based on:

Chapter 26 Phylogeny and the Tree of Life

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches

EVOLUTIONARY DISTANCES

Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies

An Investigation of Phylogenetic Likelihood Methods

A (short) introduction to phylogenetics

Phylogenetic analyses. Kirsi Kostamo

Perfect Sorting by Reversals and Deletions/Insertions

Michael Yaffe Lecture #5 (((A,B)C)D) Database Searching & Molecular Phylogenetics A B C D B C D

Page 1. Evolutionary Trees. Why build evolutionary tree? Outline

Mul$ple Sequence Alignment Methods. Tandy Warnow Departments of Bioengineering and Computer Science h?p://tandy.cs.illinois.edu

Phylogeny and the Tree of Life

CONTENTS. P A R T I Genomes 1. P A R T II Gene Transcription and Regulation 109

Using phylogenetics to estimate species divergence times... Basics and basic issues for Bayesian inference of divergence times (plus some digression)

Computational statistics

A Minimum Spanning Tree Framework for Inferring Phylogenies

Concepts and Methods in Molecular Divergence Time Estimation

What is Phylogenetics

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

Phylogeny & Systematics

Chapter 19 Organizing Information About Species: Taxonomy and Cladistics

C3020 Molecular Evolution. Exercises #3: Phylogenetics

I519 Introduction to Bioinformatics, Genome Comparison. Yuzhen Ye School of Informatics & Computing, IUB

Chromosomal Breakpoint Reuse in Genome Sequence Rearrangement ABSTRACT

Algorithms in Bioinformatics

8/23/2014. Phylogeny and the Tree of Life

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

Bio 1B Lecture Outline (please print and bring along) Fall, 2007

Phylogeny and the Tree of Life

Phylogenetics. BIOL 7711 Computational Bioscience

Molecular Evolution & Phylogenetics

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Genome Rearrangements In Man and Mouse. Abhinav Tiwari Department of Bioengineering

Breakpoint Graphs and Ancestral Genome Reconstructions

Phylogeny 9/8/2014. Evolutionary Relationships. Data Supporting Phylogeny. Chapter 26

Outline. Genome Evolution. Genome. Genome Architecture. Constraints on Genome Evolution. New Evolutionary Synthesis 11/1/18

Outline. Genome Evolution. Genome. Genome Architecture. Constraints on Genome Evolution. New Evolutionary Synthesis 11/8/16

Consistency Index (CI)

Transcription:

p.1/7 Phylogenetic Reconstruction from Gene-Order Data Bernard M.E. Moret compbio.unm.edu Department of Computer Science University of New Mexico

p.2/7 Acknowledgments Close Collaborators: at UNM: David Bader at UT Austin: Tandy Warnow, Robert Jansen, Randy Linder Postdocs and Students: Postdocs: Tanya Berger-Wolf, Tiffani Williams Students: Diana Aranda, Zak Betz, Joel Earnest-DeYoung, Tao Liu, Mark Marron, Jijun Tang, Krister Swenson, Anna Tholse, Laura Waymire Support: National Science Foundation Alfred P. Sloan Foundation IBM Corporation

p.3/7 Overview Phylogenies and Phylogenetic Data Gene-Order Data vs. Sequence Data Phylogenetic Reconstruction Computing with Gene-Order Data Reconstruction from Gene-Order Data Experimentation in Phylogeny Some Experimental Results Open Problems and Conclusion

p.4/7 Phylogenies A phylogeny is a reconstruction of the evolutionary history of a collection of organisms. It usually takes the form of a tree. Modern organisms are placed at the leaves. Edges denote evolutionary relationships. Species correspond to edge-disjoint paths.

p.5/7 12 Species of Campanulaceae 2.42 1.75 4.25 Wahlenbergia Merciera 4.34 1.61 0.83 0.23 0.77 0.063 0.94 0.18 2.82 Trachelium Symphyandra Campanula Adenophora 0.78 2.59 1.28 3.22 3.39 1.61 Legousia Asyneuma Triodanus 2.22 4.68 3.32 Codonopsis Cyananthus 10.75 Platycodon 2.25 Tobacco

p.6/7 Herpes Viruses that Affect Humans HVS EHV2 KHSV EBV HSV1 HSV2 PRV EHV1 HHV6 VZV HHV7 HCMV

p.7/7 The Tree of Life ARCHEA Thermoproteus Methanosarcina Methanobacterium Methanococcus Halophiles Thermoplasma Pyrodictium Thermococcus Aquifex Deinococci Chlamydiae Gram positive bacteria Purple bacteria Thermotogales Spirochetes Flavobacteria Cyanobacteria Microsporidia Flagellates Entamoebae Ciliates Plants Diplomonads Trichomonads Slime molds Fungi Animals BACTERIA EUKARYA

p.8/7 Phylogenetic Data All kinds of data have been used: behavioral, morphological, metabolic, etc. Predominant choice is molecular data. Two main kinds of molecular data: sequence data (DNA sequence on genes) gene-order data (gene sequence on chromosomes) Data elements that can assume new values (according to the model) independently of others are called characters.

p.9/7 Sequence Data Typically the DNA sequence of a few genes. Characters are individual positions in the string and can assume four states. Evolves through point mutations, insertions (incl. duplications), and deletions. Find homologous genes across all organisms. Align gene sequences for the entire set (to identify gaps insertions and deletions and point mutations). Decide whether to use a single gene for each analysis or to combine the data. Lengths limited by size of genes (typically several hundred base pairs)

p. 10/7 Sequence Data: Illustration AAGACTT AAGGCCT TGGACTT AGGGCAT TAGCCCT AGCACTT AGGCAT TAGCCCA TAGACTT TGAACTT AGCACAA AGCGCTT

p. 11/7 Sequence Data: Attributes Advantages: Large amounts of data available. Accepted models of sequence evolution. Models and objective functions provide a reasonable computational framework. Problems: Sequencing errors (down to 1%). Fast evolution restricts use to a few million years. Gene evolution need not be identical to organism evolution. Multiple alignments are not well solved. Reconstruction methods do not scale well (in terms of accuracy and running time).

p. 12/7 Gene-Order Data The ordered sequence of genes on one or more chromosomes. Entire gene-order is a single character, which can assume a huge number of states. Evolves through inversions, insertions (incl. duplications), and deletions; also transpositions (in mitochondria) and translocations (between chromosomes). Identify homologous genes, including duplications. Refine rearrangement model for collection of organisms (e.g., handle bacterial operons or eukaryotic exons explicitly).

Gene-Order: Guillardia Chloroplast p. 13/7

p. 14/7 Gene-Order Data: Rearrangements 2 1 8 5 1 8 3 7 Transposition 6 7 4 6 5 2 3 4 Inversion Inverted Transposition 4 1 8 5 1 8 3 7 6 7 2 5 6 4 3 2

p. 15/7 Gene-Order Data: Inversion-Only (3,5) 2 1 12 11 3 10 4 9 5 6 7 8 (11,2) 112 2 11 5 10 4 9 3 8 6 7 11 12 1 2 3 10 4 9 5 8 6 7 (6,9) (4,7) (6) (8,9) 1 12 2 11 5 10 4 6 3 9 8 7 2 1 5 7 6 12 11 10 9 8 3 4 (1,3) (9) 12 1 11 2 3 10 4 5 6 7 8 9 (6) (7,8) (4,9) 1 12 2 11 5 4 3 9 8 10 6 7 2 5 1 7 6 3 4 12 11 10 9 8 112 2 11 5 10 7 9 6 4 3 8 3 1 12 2 11 10 4 9 5 6 7 8 1 11 12 2 3 10 8 4 9 7 6 5 1 11 12 2 3 10 4 8 5 9 6 7

p. 16/7 Gene-Order Data: Attributes Advantages: Low error rate (depends on recognizing homologies). No gene tree/species tree problem. Rare evolutionary events and unlikely to cause silent" changes so can go back hundreds of millions years. Problems: Mathematics much more complex than for sequence data. Models of evolution not well characterized. Very limited data (mostly organelles). Possibly insufficient discrimination among recently evolved organisms.

Existing Gene-Order Data Mitochondria for plants and animals ( 200): single circular chromosome with 40 genes mostly transpositions and insertions/deletions Chloroplasts for plants and animals ( 100): single circular chromosome with 150 genes mostly inversions and insertions/deletions includes a single large inverted repeat Bacterial genomes ( 50): from a few hundred to several thousand genes complex evolution, many lateral transfers Also 9 nuclear genomes and 100 phages. Nuclear genomes can be very large and far more complex than organellar or bacterial genomes. Phages are viruses, with very simple genomes, but complex evolutionary events. p. 17/7

p. 18/7 Gene-Order Data vs. Sequence Data Sequence Gene-Order evolution fast slow errors significant negligible data type a few genes whole genome data quantity abundant sparse models good primitive computation easy hard

p. 19/7 Phylogenetic Reconstruction Three categories of methods: Distance-based methods, such as neighbor-joining. Parsimony-based methods (such as implemented in PAUP*, Phylip, Mega, TNT, etc.) Likelihood-based methods (including Bayesian methods, such as implemented in PAUP*, Phylip, FastDNAML, MrBayes, GAML, etc.) In addition: Meta-methods (quartet-based methods, disk-covering method) decompose the data into smaller subsets, construct trees on those subsets, and use the resulting trees to build a tree for the entire dataset.

p. 20/7 Evolutionary Distances True evolutionary distance: the actual number of permitted evolutionary events that took place to transform one datum into the other. Edit distance: the minimum number of permitted evolutionary events that can transform one datum into the other. Expected true evolutionary distance: obtained from the edit distance by correcting for the known (model or experiments) statistical relationship between true and edit distances.

p. 21/7 Distance-Based Methods Use edit or expected true evolutionary distances. Usually run in low polynomial time. Reconstruct only topologies: no ancestral data. Prototype is Neighbor-Joining; BioNJ and Weighbor are two improvements. NJ is optimal on additive distances (where the distance along a path in the true tree equals the pairwise distance in the matrix). NJ is statistically consistent (produces the true tree with probability 1 as the sequence length goes to infinity).

p. 22/7 Parsimony-Based Methods Aim to minimize total number of character changes (which can be weighted to reflect statistical evidence). Assume that characters are independent. Reconstruct ancestral data. Are known not to be statistically consistent with sequence data (but examples are fairly contrived). Finding most parsimonious tree is computational very expensive (NP-hard). Optimal solutions limited to sizes around 30; heuristic solutions appear fairly good to sizes of 500.

p. 23/7 Likelihood-Based Methods Are based on a specific model of evolution and must estimate all model parameters. Produce likelihood estimate (prior or posterior conditional) for each tree. Are statistically consistent. Reconstruct only topologies. Are prone to numerical problems: likelihood of typical tree on 20 items is around 10-21 ; on 50 items, around 10-75 ;... Are presumably NP-hard; even scoring one tree is very expensive. Optimal solutions limited to sizes below 10; heuristic solutions appear fairly good to sizes of 100.

Meta-Methods General Principle: decompose the dataset into smaller, overlapping subsets, reconstruct trees for the subsets (by some base method), and combine the results into a tree for the entire dataset. Quartet-based methods: use all possible smallest subsets (a quartet is a set of 4 genomes); include Q*, Tree-Puzzle, Quartet-Cleaning. Slow and inherently inaccurate regardless of base method consistently surpassed by NJ. Disk-covering method (DCM): set up a graph from the distance matrix, find cliques in the graph, use cliques for decomposition. High-powered machinery succeeds very well, especially when tree is imbalanced. p. 24/7

p. 25/7 Computing with Gene-Order Data Distances Evolutionary models and distance corrections Reconstructing ancestral genomes The median problem

p. 26/7 Distances for Gene-Order Data Breakpoint [Sankoff et al., 1998]: Distance counting the number of altered adjacencies for identical gene content; linear time. INV [Bader, Moret, Yan, 2001]: Edit distance (inversions) for identical gene content; linear time. INV-DEL [El-Mabrouk, 2000]: Edit distance (inversions and insertions/deletions, but no duplications); doable in linear time [Liu, Moret, 2003]. ALL [Marron, Swenson, Moret, 2003]: Estimated edit distance (inversions, insertions/deletions, duplications). CLUSTER [Bergeron, Stoye, 2003]: Estimate based on number and lengths of conserved gene clusters (no duplications).

p. 27/7 Breakpoint Distance The number of adjacencies present in one genome, but not the other. G1=(1 2 3 4 5 6 7 8) G2=(1 2 5 4 3 6 7 8)

p. 28/7 Inversion Distance Given two signed gene orders of equal content, compute the inversion-only edit distance. The problem is NP-hard for unsigned permutations. Finding the shortest sequence of inversions can be done in polynomial time (Hannenhalli and Pevzner very difficult result). Running time improved by Berman and Hannenhalli, then by Kaplan, Shamir and Tarjan: O(dn), where d is the distance and n the number of genes. Theory simplified (with new algorithm, faster in practice) by Bergeron. Distance computation reduced to linear time by Bader, Moret, and Yan.

p. 29/7 Inversion Distance Algorithm is based on the breakpoint graph. Assume one permutation is identity; each green edge is a single gene. Solid red edges denote existing adjacencies and dashed red edges denote desired adjacencies. 00 11 01 00 11 0 00 1 11 0 00 1 11 00 11 01 00 11 0 00 1 11 0 00 1 11 00 000 11 1 01 01 00 11 01 11 01 00 11 0 00 1 11 2 4 3 1 0 4 3 7 8 5 6 2 1 9 Inversion distance is n - #cycles + #hurdles + (fortress)

p. 30/7 Gene-Order Distances in General The signed gene orders may include duplications and need not have identical gene content. Extension to translocations (transpositions between chromosomes) by Hannenhalli and Pevzner. Implemented by Tesler as GRIMM. Extension to inversions and deletions (no duplications allowed) by El-Mabrouk. Linear-time distance computation by Liu and Moret. Heuristic for duplications by Sankoff (exemplars). Heuristic for unequal gene content by Bourque. Bounded approximation by Marron, Swenson, and Moret.

p. 31/7 Evolutionary Models for Gene Order Strong biological evidence for inversions in chloroplasts. Strong computational evidence for transpositions in mitochondria. Computational evidence (Sankoff, El-Mabrouk) for short inversions in bacterial genomes (preserve gene clusters). Hard radiation plus repair mechanisms could create more complex rearrangements. Respective probabilities unknown.

p. 32/7 Distance Corrections for Gene Order Assume a distribution of events and compute relationship between number of events and edit distance. Wang and Warnow gave an exact derivation for correcting the breakpoint distance into an expected number of inversions (IEBP). Moret, Tang, Wang, and Warnow gave an empirical derivation for correcting the inversion edit distance into an expected number of inversions (EDE). EDE correction substantially improves the performance of both distance-based and parsimony-based reconstruction methods.

p. 33/7 EDE Distance Correction Actual number of events 200 150 100 50 Actual number of events 200 150 100 50 Actual number of events 200 150 100 50 0 0 50 100 150 200 Breakpoint Distance 0 0 50 100 150 200 Inversion Distance 0 0 50 100 150 200 EDE Distance breakpoint distance EDE corrected distance inversion distance Vertical axis is actual number of inversions. Note increased standard deviation of EDE at larger distances.

p. 34/7 Reconstructing Ancestral Genomes Goal: Reconstruct a signed gene order at each internal node in the tree to minimize sum of edge distances. Problem is NP-hard even for just three leaves and simplest of distances (breakpoint, plain inversion)! This is the median problem for signed genomes: given three genomes, produce a new genome that will minimize the sum of the distances from it to the other three.

p. 35/7 Median Problem for Breakpoints Sankoff showed to to convert this problem to the Travelling Salesperson Problem when all three have identical gene content. + +1 2 +4 +3 +1 +2 3 4 +2 3 4 1 + 1 2 + cost = max cost = 0 cost = 1 4 3 cost = 2 + edges not shown have cost = 3 an optimal solution corresponding to genome +1 +2 3 4 Adjacency A B becomes an edge from A to B The cost of an edge A B is the number of genomes that do NOT have the adjacency A B

p. 36/7 Median Problem for Inversions No simple formulation in terms of a standard optimization problem. Exact solutions given by Siepel and Moret and by Caprara for identical gene content; work well for distances to median of 0 15 inversions. Various heuristics proposed by Bourque and Pevzner and others. Siepel and Moret showed that the inversion median is preferable to the breakpoint median (fewer ties).

p. 37/7 Median with Deletions/Duplications Tang and Moret showed it can be solved exactly for small numbers of deletions and duplications. Assume no change is reversed and that changes are independent and of low probability. Gene content of median can then be determined by preferring one event (e.g., one insertion) over two concurrent events (e.g., two deletions). Knowing gene content, all combinations of duplications or insertion locations can be examined choice grows exponentially, but is manageable for organellar genomes. Accuracy is very good (less than 5% error).

p. 38/7 Gene Content of Median Assumptions: Probability of a deletion is ε (very small). Probability of no change is 1. Example: {1,2,3,4} {1,2,3,4} 1 ε {1,2,3,4} ε ε {1,2,4} 1 1 {1,2,4} {1,2,4} {1,2,4} {1,2,4} p = ε 2 p = ε

p. 39/7 Reconstruction from Gene-Order Data Distance methods NJ and Weighbor with corrected distances Combining with a DCM booster Parsimony-based methods Encoding approaches: MPBE, MPME Direct approaches: BPAnalysis, GRAPPA, MGR, DCM-GRAPPA Likelihood-based methods

p. 40/7 Distance Methods Neighbor-joining and Weighbor using BP, INV, IEBP, and EDE distances. Moret, Wang, and Warnow showed that NJ-EDE or Weighbor-EDE are clearly better than other combinations: Low error rates up to a few hundred genomes. Robust against various models of genome rearrangements. Suffer when the tree diameter is large (some large pairwise distances). Improved by using DCM boosting.

p. 41/7 The DCM Boosting Approach A family of divide-and-conquer methods (Warnow and colleagues). 1. Compute the distance matrix. 2. Threshold the distance matrix (eliminate entries above threshold). 3. Create corresponding graph and triangulate it. 4. Find maximum cliques (polynomial time in triangulated graphs). 5. Create disks (data subsets) with prescribed overlap. 6. Invoke phylogenetic base method on each disk. 7. Assemble final tree from disk trees through consensus and refinement.

p. 42/7 DCM Methods Two varieties so far (3rd in development): DCM-1: Each clique is a disk, so disk diameter minimized. DCM-2: Compute graph separator; separator plus graph component is a disk, so all disks share same subset.

p. 43/7 DCM-Boosted Distance Methods DCM-1-NJ (with an MP last step) beats NJ and greedy MP on sequence data and is robust against size, rate, and other model variations (Moret/Nakhleh/Roshan/Wang/Warnow). DCM-1-GRAPPA scales gracefully from the limit of 15 genomes for GRAPPA to at least 1,000 genomes (Tang and Moret). DCM-3-MP does better than PAUP* MP or ratchet MP on large real datasets (new results, Warnow lab).

p. 44/7 Encoding Methods: MPBE Idea: encode the signed permutation into a sequence, then use sequence methods. MPBE: Maximum Parsimony on Breakpoint Encodings (Cosner, Jansen, Moret, Raubeson, Wang, Warnow, and Wyman) New characters will be all gene adjacencies present in the data. If m such adjacencies are present, then produce strings of m binary characters: index the adjacencies arbitrarily and, for each genome, code presence or absence of that adjacency with a 1 or 0. Use an MP method to reconstruct a tree. Slower than NJ-EDE or Weighbor-EDE, cannot describe new adjacencies, not all binary strings are valid codes.

p. 45/7 Encoding Methods: MPME Idea: preserve more information than MBPE. MPME: Maximum Parsimony on Multistate Encodings (Wang, Jansen, Moret, Raubeson, and Warnow) New characters correspond to possible signed genes, so 2n characters; each character represents a signed gene and so can assume one of 2n states. Site i in the sequence takes the value of the gene immediately following gene i in the genome (order is reversed for negative signs). (1,-4,-3,-2) > (-4,3,4,-1,2,1,-2,-3) Clearly better than MPBE, but slower and number of character states quickly exceeds limits of popular MP software.

p. 46/7 Direct Approaches: BPAnalysis (due to Sankoff and Blanchette) Initially label all internal nodes with gene orders Repeat For each internal node v, with neighbors A, B, and C, do Solve the MPB on A, B, C to yield label m If relabelling v with m improves the tree score, then do it until no internal node can be relabelled

p. 47/7 The Number of Trees on N Genomes 3 genomes: 1 tree 4 genomes: 3 trees 5 genomes: 15 trees 13 genomes: 13.5 billion trees n genomes: (2n-5)!! trees (2n-5)!! = (2n-5)*(2n-7)*...*5*3

p. 48/7 GRAPPA Genome Rearrangements Analysis under Parsimony & other Phylogenetic Algorithms Began as a reimplementation of BPAnalysis. Current version runs up to one billion times faster than BPAnalysis, thanks to algorithmic engineering. Limit: every added taxon multiplies the running time by twice the number of taxa. So 13 taxa take 20 mins, 15 taxa two weeks, 16 taxa a year, 20 taxa over 2 million years, and...

p. 49/7 Algorithm Engineering: Why? Research Tools: run many experiments to test hypotheses and for discovery Production Tools: obviously, save on resources Make It Possible: a million-fold speedup allows one to solve in a week what would otherwise have been infeasible (requiring millennia)

p. 50/7 Algorithm Engineering: What? good asymptotic performance low proportionality constants fast running times on important real-world datasets robust performance across a variety of data robust performance across a variety of platforms scalability to faster platforms and larger datasets

p. 51/7 Algorithm Engineering: How? Optimize low-level data structures (multiple parallel arrays, maintenance vs. recomputation) Optimize low-level algorithmic details (e.g., upper and lower bounds) Optimize low-level coding (hand-unrolling loops, keeping local variables in registers) Reduce memory footprint (the entire code might fit in cache) Maximize locality of reference (the memory hierarchy can cause differentials of 100:1) Repeatedly rebalance execution profile (eliminate bottlenecks)

p. 52/7 MGR Multichromosomal Genome Rearrangement (Bourque and Pevzner) Uses the median approach, but does not solve it exactly. Can handle multiple chromosomes (translocations) using Hannenhalli-Pevzner s results. Scaling unknown published results to date limited to small datasets. Accuracy second only to GRAPPA.

p. 53/7 DCM-GRAPPA Our extension to GRAPPA to scale it to large datasets (Tang and Moret). Scales gracefully to at least 1,000 genomes (less than 2 days of computation). Retains accuracy of GRAPPA: error rates on 1,000-genome datasets are consistently below 3%. Uses the DCM-1 approach may do even better with forthcoming DCM-3.

p. 54/7 DCM-GRAPPA: Details Compute pairwise distances Check all possible threshold values For each threshold value Discard values above threshold Create graph from reduced distance matrix Triangulate the graph Find maximum cliques (disks) in the graph Run GRAPPA (or recursive DCM-GRAPPA) on the disks Merge the resulting trees

p. 55/7 Likelihood Approaches Only one to date: an MCMC method (Larget, Kadane, and Simon) Relies on two distinct moves through tree space. Essentially untested (just two small datasets used in report). Promising results, but unknown running time. Code not available.

p. 56/7 Testing Algorithms How to choose test sets? Biological datasets test performance where it matters, but can be used only for ranking, are too few to permit quantitative evaluations, and are often hard ot obtain. Good for anecdotal reports and reality checks." Simulated datasets enable absolute evaluations of solution quality and can be generated in arbitrarily large numbers. Only way to obtain valid characterizations.

p. 57/7 Experimental Algorithmics How to test algorithms empirically to obtain reliable characterization of performance. Can use experience in physical sciences, but must adapt to specific circumstances of computer-based experimentation: Algorithm analysis uses asymptotic characterizations cannot be inferred from finite data. Easy to alter experiments and re-run whole series leads to drift and biases. Controlling variables is just as hard in computational experiments as in the natural sciences even reproducibility is a challenge.

p. 58/7 Experimental Algorithmics: Caveat At the First Workshop on Algorithms in Bioinformatics, Alberto Caprara remarked a theoretician solves interesting problems with no practical use whatsoever whereas an experimental algorithmist solves problems of significant practical use, but only tests algorithms on randomly generated instances The types of instances generated in an experimental study make the difference between useful research and wasting a lot of (students and CPUs ) cycles.

p. 59/7 Phylogenetic Considerations Tree shape plays a very large role. Challenge: the shape of the true trees is unknown; moreover, the shape depends on the selection of genomes e.g., a single genus vs. a sampling of an entire kingdom. The evolutionary model is important. Challenge: devise an evolutionary model with few parameters that is easily manipulated analytically and computationally and that produces realistic data. (It would also be pleasant if it made biological sense...) Test a large range of parameters and use many runs for each setting to estimate variance. Challenge: even the simplest of models induces a huge parameter space sampling it requires smoothness guarantees, but NP-hard search spaces lack smoothness.

p. 60/7 Typical Simulation Setup TRUE TREE INFERRED TREE A C A B C D B D 1 3 A 1 2 3 4 5 C 3 4 2 1 5 8 8 7 6 B 1 3 7 7 2 6-6 -5-4 D 8 1 2 8 3 7-6 -5-4 2 A B C D A B 4 C 3 D 1 6 5 3 REARRANGED GENOMES DISTANCE MATRIX

p. 61/7 Tree Topologies: Popular Models Birth-Death: start with root branch; in any t interval, there is probability p of speciation along any of the current branches. Biologically motivated; trees are well balanced too well balanced compared to published phylogenies. All distance-based methods do very well on BD trees; DCM decompositions are poor, but not needed. Uniform Random: all tree topologies equally likely. No biological process; trees are fairly imbalanced too imbalanced compared to published phylogenies. NJ does very poorly on uniform trees; DCM-NJ significantly better.

Tree Topologies: New Models Aldous β-splitting: parameter can be set to produce anything from a ladder (caterpillar), through uniform, birth-death, to perfect balance. No biological process; single parameter cannot localize structure. Recommended setting (β = 1) matches balance of published phylogenies, but algorithmic behavior does not match biological datasets. Heard s Multi-Parameter: variable evolutionary rates, inheritable speciation traits, punctuated and gradual evolution. Strong biological motivation; too many parameters? Subtle differences cause large changes in performance of distance- and parsimony-based methods. p. 62/7

p. 63/7 Generating Genome Rearrangements Questions: What rearrangement events? are they weighted? Are there forbidden regions (e.g., no inversion across a centromere)? preferred regions or endpoints (hotspots)? Relative cost of insertions, duplications, and deletions? How to test for robustness?

p. 64/7 The Parameter Space Clearly too large for good sampling: tree topologies evolutionary models evolutionary rates tree sizes genome sizes multiple chromosomes Need to focus testing on appropriate subspace.

p. 65/7 Some Experimental Results Distance-Based Methods (with and without correction) GRAPPA DCM-GRAPPA

p. 66/7 Results: Distance Methods inversion/transposition/inverted transposition: 1:1:1 ratio 120 genes per genome, 10-20-40-80-160 genomes False Negative Rate (%) 70 60 50 40 30 20 NJ(BP) NJ(INV) NJ(IEBP) NJ(EDE) 10 0 0 0.2 0.4 0.6 0.8 1 Normalized Maximum Pairwise Inversion Distance

p. 67/7 Results: GRAPPA Biological dataset: 11 plants cpdna genomes, 37 genes Note: inverted repeat not used here Cyanophora Cyanidium Porphyra Guillardia Odontella Marchantia Tobacco Mesostigma Nephroselmis Chlorella Left: expected tree (placement of Euglena unknown) Right: GRAPPA tree

p. 68/7 Results: DCM-GRAPPA inversion-only evolution, expected edge length 4 100 genes per genome, 20-40-80-160-320-640 genomes 500 100 50 NJ DCM-NJ DCM-GRAPPA FP 10 5 0.5 1 0.1 0.01 20 40 80 160 320 640 number of genomes Shown is total number of edges in error (log/log scale)

p. 69/7 Some Open Problems Tree models Evolutionary models Extensions of Hannenhalli-Pevzner theory to handle transpositions and inversions length-dependent rearrangements position-dependent rearrangements duplications Good combinatorial formulation of the median problem for inversions and for more general cases. Tighter bounds on tree scores. Extensions to phylogenetic networks (!)

p. 70/7 Conclusions Gene-order data carries very good phylogenetic information much better than sequence data! Current algorithmic approaches scale to significant sizes (1,000 for DCM-GRAPPA) comparable to the best achievable with sequence data and with better results. Current approaches remain unable to handle unequal gene content with duplications, but major progress has been made over the last 5 years. Data availability is increasing rapidly for organellar genomes, slowly for nuclear genomes, and remains very limited compared to sequence data.

p. 71/7 compbio.unm.edu Laboratory for High-Performance Algorithm Engineering and Computational Molecular Biology Includes all publications by our lab, GRAPPA source files, email addresses, and links to our main collaborators.