Graphs, permutations and sets in genome rearrangement

Similar documents
I519 Introduction to Bioinformatics, Genome Comparison. Yuzhen Ye School of Informatics & Computing, IUB

Genome Rearrangements In Man and Mouse. Abhinav Tiwari Department of Bioengineering

On the complexity of unsigned translocation distance

AN EXACT SOLVER FOR THE DCJ MEDIAN PROBLEM

Martin Bader June 25, On Reversal and Transposition Medians

Perfect Sorting by Reversals and Deletions/Insertions

EVOLUTIONARY DISTANCES

Greedy Algorithms. CS 498 SS Saurabh Sinha

Sorting by Transpositions is Difficult

I519 Introduction to Bioinformatics, Genome Comparison. Yuzhen Ye School of Informatics & Computing, IUB

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison

17 Non-collinear alignment Motivation A B C A B C A B C A B C D A C. This exposition is based on:

The combinatorics and algorithmics of genomic rearrangements have been the subject of much

Multiple Whole Genome Alignment

Genome Rearrangement Problems with Single and Multiple Gene Copies: A Review

Genome Rearrangement Problems with Single and Multiple Gene Copies: A Review

An experimental investigation on the pancake problem

Comparing whole genomes

Discrete Applied Mathematics

Reconstructing genome mixtures from partial adjacencies

Computer Science 385 Analysis of Algorithms Siena College Spring Topic Notes: Limitations of Algorithms

Algorithms in Computational Biology (236522) spring 2008 Lecture #1

Algorithms for Bioinformatics

Chromosomal rearrangements in mammalian genomes : characterising the breakpoints. Claire Lemaitre

arxiv: v1 [math.co] 21 Sep 2014

Comparative genomics: Overview & Tools + MUMmer algorithm

CONTENTS. P A R T I Genomes 1. P A R T II Gene Transcription and Regulation 109

ARRANGEABILITY AND CLIQUE SUBDIVISIONS. Department of Mathematics and Computer Science Emory University Atlanta, GA and

FIVE-LIST-COLORING GRAPHS ON SURFACES I. TWO LISTS OF SIZE TWO IN PLANAR GRAPHS

Some Algorithmic Challenges in Genome-Wide Ortholog Assignment

A Framework for Orthology Assignment from Gene Rearrangement Data

Let S be a set of n species. A phylogeny is a rooted tree with n leaves, each of which is uniquely

Exploring Phylogenetic Relationships in Drosophila with Ciliate Operations

Analysis of Gene Order Evolution beyond Single-Copy Genes

Isolating - A New Resampling Method for Gene Order Data

Effects of Gap Open and Gap Extension Penalties

arxiv: v1 [math.co] 19 Aug 2016

Computational Biology: Basics & Interesting Problems

Genomes Comparision via de Bruijn graphs

Counting and Constructing Minimal Spanning Trees. Perrin Wright. Department of Mathematics. Florida State University. Tallahassee, FL

Fractional coloring and the odd Hadwiger s conjecture

A NOTE ON THE PRECEDENCE-CONSTRAINED CLASS SEQUENCING PROBLEM

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Steps Toward Accurate Reconstructions of Phylogenies from Gene-Order Data 1

Three-dimensional Stable Matching Problems. Cheng Ng and Daniel S. Hirschberg. Department of Information and Computer Science

Sequence analysis and comparison

Multiple genome rearrangement

Intro Gene regulation Synteny The End. Today. Gene regulation Synteny Good bye!

Chromosomal Breakpoint Reuse in Genome Sequence Rearrangement ABSTRACT

Synteny Portal Documentation

Analysis and Design of Algorithms Dynamic Programming

Edit Distance with Move Operations

Presentation by Julie Hudson MAT5313

arxiv: v1 [math.co] 25 Apr 2013

A Phylogenetic Network Construction due to Constrained Recombination

Tandem Mass Spectrometry: Generating function, alignment and assembly

Mathematical Induction Again

Random sorting networks

NP-Hardness reductions

Multiple Sequence Alignment: HMMs and Other Approaches

Genome Rearrangement

Independence in Function Graphs

Comparing Genomes with Duplications: a Computational Complexity Point of View

Mathematical Induction Again

Inferring positional homologs with common intervals of sequences

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

SAT in Bioinformatics: Making the Case with Haplotype Inference

via Tandem Mass Spectrometry and Propositional Satisfiability De Novo Peptide Sequencing Renato Bruni University of Perugia

Computer Proofs in Algebra, Combinatorics and Geometry

Breakpoint Graphs and Ancestral Genome Reconstructions

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

An Improved Algorithm for Ancestral Gene Order Reconstruction

Physical Mapping. Restriction Mapping. Lecture 12. A physical map of a DNA tells the location of precisely defined sequences along the molecule.

Permutation groups/1. 1 Automorphism groups, permutation groups, abstract

Dr. Amira A. AL-Hosary

CSC 8301 Design & Analysis of Algorithms: Lower Bounds

Molecular evolution - Part 1. Pawan Dhar BII

Efficient Haplotype Inference with Boolean Satisfiability

Some operations preserving the existence of kernels

4.1 Notation and probability review

Patterns of Simple Gene Assembly in Ciliates

Genome Rearrangement. 1 Inversions. Rick Durrett 1

Denitive Proof of the Twin-Prime Conjecture

Techniques for Multi-Genome Synteny Analysis to Overcome Assembly Limitations

Probe Placement Problem

BIOINFORMATICS: An Introduction

Phylogenetic Networks, Trees, and Clusters

Rearrangements and polar factorisation of countably degenerate functions G.R. Burton, School of Mathematical Sciences, University of Bath, Claverton D

1 I A Q E B A I E Q 1 A ; E Q A I A (2) A : (3) A : (4)

Scaffold Filling Under the Breakpoint Distance

A New Fast Heuristic for Computing the Breakpoint Phylogeny and Experimental Phylogenetic Analyses of Real and Synthetic Data

Motivating the need for optimal sequence alignments...

Advances in Phylogeny Reconstruction from Gene Order and Content Data

Haplotyping as Perfect Phylogeny: A direct approach

Restricted b-matchings in degree-bounded graphs

ground state degeneracy ground state energy

Gene Maps Linearization using Genomic Rearrangement Distances

CS 301: Complexity of Algorithms (Term I 2008) Alex Tiskin Harald Räcke. Hamiltonian Cycle. 8.5 Sequencing Problems. Directed Hamiltonian Cycle

1 Introduction. Abstract

A PARSIMONY APPROACH TO ANALYSIS OF HUMAN SEGMENTAL DUPLICATIONS

Transcription:

ntroduction Graphs, permutations and sets in genome rearrangement 1 alabarre@ulb.ac.be Universite Libre de Bruxelles February 6, 2006 Computers in Scientic Discovery 1 Funded by the \Fonds pour la Formation a la Recherche dans l'ndustrie et dans l'agriculture" (F.R..A.).

ntroduction ntroduction A few biological denitions Sequence alignment Genome rearrangement General statement of the problem Comparative genomics The On-Line Encyclopedia of nteger Sequences Computer-assisted proofs Parallel computing Further possible uses

ntroduction A few biological denitions Sequence alignment Genome rearrangement General statement of the problem A few biological denitions Life = DNA DNA = double helix of nucleotides (A, C, G and T) Genes = sequences of nucleotides Chromosome = (ordered) set of genes

ntroduction A few biological denitions Sequence alignment Genome rearrangement General statement of the problem Sequence alignment Comparison at the nucleotide level Example: T C C T G C C A T C T A j j j j j j j j j T C G T G A C A T G G C A Matches, dierences, insertions and deletions

ntroduction A few biological denitions Sequence alignment Genome rearrangement General statement of the problem Genome rearrangement Comparison at the gene level Species dier not only by \content", but also by order: genes spread over dierent sets of chromosomes genes ordered dierently on the same chromosome Example: many genes in cabbage and turnip are 99% identical

ntroduction A few biological denitions Sequence alignment Genome rearrangement General statement of the problem General statement of the problem The problem to solve can be summarized as: Given two (or more) genomes, nd a sequence of mutations that transforms one into the other and is of minimal length. Dierent assumptions yield dierent models: gene order gene orientation duplications/deletions in the genome mutations taken into account weights given to mutations miscellaneous restrictions

ntroduction Known gene order: permutations Assumptions: gene order is known each gene appears exactly once in each genome Therefore: fgenesg = f1 2 : : : ng genome = permutation of f1 2 : : : ng One or several operations \Computing the X distance" \Sorting by X 's"

o y o y o y o y o o Outline ntroduction [Bafna and Pevzner, 1995] A transposition exchanges adjacent intervals: ( 5 4 3 2 1 ) Cycle graph: # ( 5 2 1 4 3 ) 0 5 4 3 2 1 6 Complexity and diameter are unknown

ntroduction : example ( 5 4 3 2 1 ) # ( 3 2 5 4 1 ) # ( 3 4 1 2 5 ) # (1 2 3 4 5)

ntroduction : some personal results n [Labarre, 2005]: Classes of permutations for which the distance can be computed New tight upper bound n [Doignon and Labarre, 2006]: Bijection between cycle graph structures and factorisations of permutations Formula for the number of permutations with a given cycle graph structure n [Labarre, 2006]: tighter bounds and other tractable classes of permutations

ntroduction Genes spread over dierent chromosomes: sets A genome G is dened by: a set of n genes f1 2 : : : ng, and a collection of k chromosomes C1, C 2, : : :, C k Problem: given genomes G1 = fc 11 C 12 : : : C 1k g G 2 = fc 21 C 22 : : : C 2l g and a set of operations, nd the minimum sequence of operations bringing G 1 into G 2

ntroduction [Ferretti et al., 1996] Two genes on the same chromosome are \in synteny" The set of operations consists of: 1. ssions C! fc 1 C 2 g 2. fusions fd 1 D 2 g! D 3. translocations fc 1 [ C 2 D 1 [ D 2 g! fc 1 [ D 1 C 2 [ D 2 g Canonical form for this problem: Transform genome G = fc 1 C 2 : : : C k g into ff1g f2g : : : fngg Problem is NP{hard [DasGupta et al., 1998]

ntroduction Synteny graph Vertices are subsets C i Edges are fc i C j g (i 6= j) such that C i \ C j 6= f1 3 4g f4 5 6g = = = = = = = = = = = = = = = = = f3 5 8g f2g f1g u uuuu u uuu f2 7g f2 9g Goal: eliminate all edges and obtain only singletons translocations: 0 ssions: 0

ntroduction Synteny graph Vertices are subsets C i Edges are fc i C j g (i 6= j) such that C i \ C j 6= f1 3 4g f3 4 5g f1g f6 8g f2g u uuuu u uuu f2 7g f2 9g Goal: eliminate all edges and obtain only singletons translocations: 1 ssions: 0

ntroduction Synteny graph Vertices are subsets C i Edges are fc i C j g (i 6= j) such that C i \ C j 6= f1 3 4g f5g f6 8g f2g f1g u uuuu u uuu f2 7g f2 9g Goal: eliminate all edges and obtain only singletons translocations: 2 ssions: 0

ntroduction Synteny graph Vertices are subsets C i Edges are fc i C j g (i 6= j) such that C i \ C j 6= f3 4g f6 8g f2g u uuuu u uuu f2 7g f2 9g f5g f1g Goal: eliminate all edges and obtain only singletons translocations: 3 ssions: 0

ntroduction Synteny graph Vertices are subsets C i Edges are fc i C j g (i 6= j) such that C i \ C j 6= f3 4g f6 8g f2g f7g f2 9g f5g f1g Goal: eliminate all edges and obtain only singletons translocations: 4 ssions: 0

ntroduction Synteny graph Vertices are subsets C i Edges are fc i C j g (i 6= j) such that C i \ C j 6= f3 4g f6 8g f2g f7g f9g f5g f1g Goal: eliminate all edges and obtain only singletons translocations: 5 ssions: 0

ntroduction Synteny graph Vertices are subsets C i Edges are fc i C j g (i 6= j) such that C i \ C j 6= f3gf4g f6 8g f2g f7g f9g f5g f1g Goal: eliminate all edges and obtain only singletons translocations: 5 ssions: 1

ntroduction Synteny graph Vertices are subsets C i Edges are fc i C j g (i 6= j) such that C i \ C j 6= f3gf4g f6gf8g f2g f7g f9g f5g f1g Goal: eliminate all edges and obtain only singletons translocations: 5 ssions: 2

ntroduction Comparative genomics The On-Line Encyclopedia of nteger Sequences Computer-assisted proofs Parallel computing Further possible uses Comparative genomics Extensive use of computers in order to compare a set of species, you need to: 1. get their genomes: from a database if it has been done by sequencing them otherwise 2. infer their phylogeny: 2.1 compute distances pairwise (can be NP-hard) 2.2 reconstruct putative scenarios (exponential number) 2.3 discriminate (exponential number of good scenarios) 3. make a choice between topologies Use of software for each task

ntroduction Comparative genomics The On-Line Encyclopedia of nteger Sequences Computer-assisted proofs Parallel computing Further possible uses The On-Line Encyclopedia of nteger Sequences 2 \What is the set of all objects that satisfy property P?" 1. for k = 0 1 2 : : :, generate all elements and count those that verify P 2. cardinalities form a sequence 3. input sequence into the Encyclopedia 4. assuming there are matches, try to relate the sets of objects counted Examples: maximal instances for a particular distance instances with distance k (distribution) graphs with a given structure [Doignon and Labarre, 2006] 2 http://www.research.att.com/~njas/sequences/

ntroduction Comparative genomics The On-Line Encyclopedia of nteger Sequences Computer-assisted proofs Parallel computing Further possible uses Computer-assisted proofs Best approximation ratio for sorting by transpositions is 11=8 [Elias and Hartman, 2005] The proof is computer-assisted: 1. generate cases to test (more than 80,000) 2. solve cases 3. verify solutions Previous notorious examples: the Four Colour Theorem [Appel and Haken, 1977, Appel et al., 1977] the proof of Kepler's conjecture, which is to become even more computer-driven (see papers by Thomas Hales 3 ) Heated topic 3 http://www.math.pitt.edu/~thales/

ntroduction Comparative genomics The On-Line Encyclopedia of nteger Sequences Computer-assisted proofs Parallel computing Further possible uses Parallel computing Genomes can be huge So are the running times of exact algorithms for NP-hard (GR) problems When possible: partition instances into \independently sortable components" assign each component to a dierent CPU/machine

ntroduction Comparative genomics The On-Line Encyclopedia of nteger Sequences Computer-assisted proofs Parallel computing Further possible uses Further possible uses Underlying graph problems in genome rearrangement possible uses of GraPHedron? Characterization of special classes of permutations: development of a conjecture-making tool on permutations? Eciently solving GR problems (work by Fertin et al.): SAT is a well-studied NP-hard problem transform instances of GR into instances of SAT solve GR problem through SAT solvers

ntroduction Comparative genomics The On-Line Encyclopedia of nteger Sequences Computer-assisted proofs Parallel computing Further possible uses Appel, K. and Haken, W. (1977). Every planar map is four colorable.. Discharging. llinois J. Math., 21(3):429{490. Appel, K., Haken, W., and Koch, J. (1977). Every planar map is four colorable.. Reducibility. llinois J. Math., 21(3):491{567. Bafna, V. and Pevzner, P. A. (1995). Sorting permutations by transpositions. n Proceedings of SODA, pages 614{623. ACM/SAM. DasGupta, B., Jiang, T., Kannan, S., Li, M., and Sweedyk, E. (1998). On the complexity and approximation of syntenic distance. Discrete Applied Mathematics, 88(1-3):59{82. Doignon, J.-P. and Labarre, A. (2006). On Hultman numbers. Submitted. Elias,. and Hartman, T. (2005). A 1:375-approximation algorithm for sorting by transpositions. n Proceedings of WAB, LNB 3692, pages 204{214.

ntroduction Comparative genomics The On-Line Encyclopedia of nteger Sequences Computer-assisted proofs Parallel computing Further possible uses Ferretti, V., Nadeau, J. H., and Sanko, D. (1996). Original synteny. n Proceedings of CPM, LNCS 1075, pages 159{167. Labarre, A. (2005). A new tight upper bound on the transposition distance. n Proceedings of WAB, LNB 3692, pages 216{227. Labarre, A. (2006). New bounds and tractable instances for the transposition distance. Submitted.