Algorithms for Bioinformatics

Similar documents
Genome Rearrangements In Man and Mouse. Abhinav Tiwari Department of Bioengineering

Greedy Algorithms. CS 498 SS Saurabh Sinha

Multiple Whole Genome Alignment

AN EXACT SOLVER FOR THE DCJ MEDIAN PROBLEM

CSE 421 Greedy Algorithms / Interval Scheduling

Perfect Sorting by Reversals and Deletions/Insertions

I519 Introduction to Bioinformatics, Genome Comparison. Yuzhen Ye School of Informatics & Computing, IUB

Computational Genetics Winter 2013 Lecture 10. Eleazar Eskin University of California, Los Angeles

Isolating - A New Resampling Method for Gene Order Data

The combinatorics and algorithmics of genomic rearrangements have been the subject of much

On the complexity of unsigned translocation distance

Genome Rearrangement Problems with Single and Multiple Gene Copies: A Review

17 Non-collinear alignment Motivation A B C A B C A B C A B C D A C. This exposition is based on:

Genome Rearrangement Problems with Single and Multiple Gene Copies: A Review

Breakpoint Graphs and Ancestral Genome Reconstructions

A Framework for Orthology Assignment from Gene Rearrangement Data

Genomes Comparision via de Bruijn graphs

CSE 417. Chapter 4: Greedy Algorithms. Many Slides by Kevin Wayne. Copyright 2005 Pearson-Addison Wesley. All rights reserved.

Chromosomal rearrangements in mammalian genomes : characterising the breakpoints. Claire Lemaitre

BMI/CS 776 Lecture #20 Alignment of whole genomes. Colin Dewey (with slides adapted from those by Mark Craven)

Scaffold Filling Under the Breakpoint Distance

I519 Introduction to Bioinformatics, Genome Comparison. Yuzhen Ye School of Informatics & Computing, IUB

Chromosomal Breakpoint Reuse in Genome Sequence Rearrangement ABSTRACT

Analysis of Gene Order Evolution beyond Single-Copy Genes

Towards More Effective Formulations of the Genome Assembly Problem

Patterns of Simple Gene Assembly in Ciliates

Approximation Algorithms (Load Balancing)

Chapter 4. Greedy Algorithms. Slides by Kevin Wayne. Copyright 2005 Pearson-Addison Wesley. All rights reserved.

arxiv: v1 [math.co] 19 Aug 2016

A Methodological Framework for the Reconstruction of Contiguous Regions of Ancestral Genomes and Its Application to Mammalian Genomes

Information Theory and Statistics Lecture 2: Source coding

Algorithms Exam TIN093 /DIT602

CS173 Lecture B, November 3, 2015

Data Structures in Java

1 Matchings in Non-Bipartite Graphs

Genome Rearrangement. 1 Inversions. Rick Durrett 1

Chapter 3. September 11, ax + b = 0.

Divide and Conquer Algorithms. CSE 101: Design and Analysis of Algorithms Lecture 14

Comparing whole genomes

Algorithms and Data Structures 2016 Week 5 solutions (Tues 9th - Fri 12th February)

Conic Sections: THE ELLIPSE

Lecture 12 : Graph Laplacians and Cheeger s Inequality

R.2 Number Line and Interval Notation

16 circles. what goes around...

Multi-Break Rearrangements and Breakpoint Re- Uses: From Circular to Linear Genomes

Chapter 7. Network Flow. Slides by Kevin Wayne. Copyright 2005 Pearson-Addison Wesley. All rights reserved.

Proof methods and greedy algorithms

Algorithm Design and Analysis

Inferring positional homologs with common intervals of sequences

Breadth First Search, Dijkstra s Algorithm for Shortest Paths

Fast Phylogenetic Methods for the Analysis of Genome Rearrangement Data: An Empirical Study

On Reversal and Transposition Medians

We saw in Section 5.1 that a limit of the form. arises when we compute an area.

Evolutionary Tree Analysis. Overview

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison

A New Genomic Evolutionary Model for Rearrangements, Duplications, and Losses that Applies across Eukaryotes and Prokaryotes

Data Compression Techniques

Divide-and-conquer. Curs 2015

Martin Bader June 25, On Reversal and Transposition Medians

Chapter 4. Greedy Algorithms. Slides by Kevin Wayne. Copyright 2005 Pearson-Addison Wesley. All rights reserved.

Genes order and phylogenetic reconstruction: application to γ-proteobacteria

Biological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor

Algorithm Design and Analysis

More on NP and Reductions

COMPSCI 611 Advanced Algorithms Second Midterm Exam Fall 2017

An experimental investigation on the pancake problem

Part V. Matchings. Matching. 19 Augmenting Paths for Matchings. 18 Bipartite Matching via Flows

Lecture 14 - P v.s. NP 1

1 Matrix notation and preliminaries from spectral graph theory

Divide-and-conquer: Order Statistics. Curs: Fall 2017

1 Terminology and setup

Some Algorithmic Challenges in Genome-Wide Ortholog Assignment

Lecture 4. Quicksort

Alignment Algorithms. Alignment Algorithms

CS360 Homework 12 Solution

Design and Analysis of Algorithms

7.5 Bipartite Matching

4 The Cartesian Coordinate System- Pictures of Equations

How to fail Hyperbolic Geometry

POINTS, LINES, DISTANCES

Linear-Time Algorithms for Finding Tucker Submatrices and Lekkerkerker-Boland Subgraphs

Fall 2017 November 10, Written Homework 5

Trees and Meta-Fibonacci Sequences

DE BRUIJN GRAPHS AND THEIR APPLICATIONS TO FAULT TOLERANT NETWORKS

The lattice model of polymer solutions

Aditya Bhaskara CS 5968/6968, Lecture 1: Introduction and Review 12 January 2016

Human biology Laboratory. Cell division. Lecturer Maysam A Mezher

Clairvoyant scheduling of random walks

A PARSIMONY APPROACH TO ANALYSIS OF HUMAN SEGMENTAL DUPLICATIONS

Algorithm Design and Analysis

Permutations and Combinations

k-protected VERTICES IN BINARY SEARCH TREES

CS781 Lecture 3 January 27, 2011

Math Precalculus I University of Hawai i at Mānoa Spring

Canonical Forms Some questions to be explored by high school investigators William J. Martin, WPI

Péter Ligeti Combinatorics on words and its applications

Chapter 13 Meiosis and Sexual Reproduction

Lecture 15: The Hidden Subgroup Problem

Pythagoras Theorem. What it is: When to use: What to watch out for:

Lecture 2 September 4, 2014

Transcription:

Adapted from slides by Alexandru Tomescu, Leena Salmela, Veli Mäkinen, Esa Pitkänen 582670 Algorithms for Bioinformatics Lecture 5: Combinatorial Algorithms and Genomic Rearrangements 1.10.2015

Background We now have genomes of several species available It is possible to compare genomes of two or more different species = Comparative genomics See Lecture 4 Basic observation: Closely related species (such as human and mouse) can be almost identical in terms of genomic content...... but the order of genomic segments can be very different between species 2 / 40

Example: Human vs. Mouse Human chromosome 6 contains elements from six different mouse chromosomes In chromosome X the rearrangements are mostly within the same chromosome Image from: Gregory et al.: A physical map of the mouse genome. Nature 418, 743-750 (15 August 2002). 3 / 40

4 / 40

Genomic Rearrangements The differences result from genomic rearrangement events Rare evolutionary events The number of such events can be used for estimating the evolutionary distance between species Problem: What is the minimum number of events needed to rearrange one genome into the other? A kind of edit distance 5 / 40

Synteny Blocks Synteny block = similar region in two different genomes Contain homologous sequences and usually genes with similar function Different locations, possibly even different chromosomes Possibly different orientations (i.e., different strands) 4 +3 +6 5 +1 +2 synteny blocks +1 +2 +3 +4 +5 +6 6 / 40

Signed Permutations Assign numbers 1, 2,..., n to the synteny blocks Assign signs (+ or ) to denote orientation A genome is then represented by a sequence of n signed numbers called a signed permutation Permutation because each (unsigned) number appears exactly once 4 +3 +6 5 +1 +2 synteny blocks +1 +2 +3 +4 +5 +6 7 / 40

Sorting permutations The problem is then to convert one signed permutation to the other Using what operations/events? (next slide) By convention, the numbering and signs are chosen so that one of the genomes is represented by the identity permutation (+1 + 2 + n) Then we want to convert the other signed permutation into the identity permutation This is called sorting the permutation 8 / 40

Reversals The most common genomic rearrangement event is reversal A contiguous section of a chromosome is reversed The orientation (sign) of all synteny blocks within the section changes 4 +3 +6 5 +1 +2 4 +5 6 3 +1 +2 9 / 40

Sorting by Reversals problem Goal: Find the shortest series of reversals that tranforms a given signed permutation to the identity permutation Input: Signed permutation P of the numbers 1,..., n Output: A series of reversals that transforms P into (+1 +2 +n). Objective: Minimize the number of reversals The smallest possible number of reversals is called the reversal distance and is denoted by d rev (P). 10 / 40

Sorting by Reversals: Example 4 +3 +6 5 +1 +2 4 +3 2 1 +5 6 +1 +2 3 +4 +5 6 +1 +2 +3 +4 +5 6 +1 +2 +3 +4 +5 +6 This shows that d rev ( 4 +3 +6 5 +1 +2) 4 Is four the smallest number of reversals? 11 / 40

Greedy reversal sort Greedy reversal sort algorithm Move 1 to correct location and orientation Move 2 to correct location and orientation without moving 1 again Move 3 to correct location and orientation without moving 1 or 2 etc. Resembles selection sort 12 / 40

Greedy reversal sort: Example 4 +3 +6 5 +1 +2 1 +5 6 3 +4 6 +1 +5 6 3 +4 +2 +1 2 4 +3 +6 5 +1 +2 4 +3 +6 5 +1 +2 3 +4 +6 5 +1 +2 +3 +4 +6 5 +1 +2 +3 +4 +5 6 +1 +2 +3 +4 +5 +6 13 / 40

How good is greedy reversal sort? Not so good In our example, greedy reversal sort needed 8 reversals but we know that d rev 4 Even worse case is P = ( 6 +1 +2 +3 +4 +5) Greedy reversal sort needs 10 reversals but drev (P) = 2 6 +1 +2 +3 +4 +5 5 4 3 2 1 +6 +1 +2 +3 +4 +5 +6 For any n, there is a permutation P of length n for which greedy reversal sort requires at least (n 1) d rev (P) reversals 14 / 40

Approximation algorithms and approximation ratios Greedy reversal sort is an approximation algorithm. It only produces an approximate solution. A(P): approximate solution returned by algorithm A OPT (P): optimal solution The approximation ratio of (minimization) algorithm A is the maximum approximation ratio over all inputs of size n: max P =n A(P) OPT (P) The approximation ratio for greedy reversal sort is thus at least (n 1) 15 / 40

Breakpoints and Adjacencies Consider a permutation P = (p 1 p 2 p n ) Add p 0 = 0 to the beginning and p n+1 = +(n + 1) to the end Each consecutive pair (p i p i+1 ) is Adjacency if pi+1 p i = 1 Breakpoint otherwise Example P = (+3 +4 +5 1 2 7 6 +8) 0 +3 +4 +5 1 2 7 6 +8 +9 Adjacencies: (+3 +4), (+4 +5), ( 7 6), (+8 +9) Breakpoints: (0 +3), (+5 1), ( 1 2), ( 2 7), ( 6 +8) In the identity permutation (+1 +n), all pairs are adjacencies 16 / 40

Reversals and breakpoints How a reversal affects a pair (p i p i+1 ) No change if (p i p i+1 ) is outside the reversal region Reversal and inversion of signs if (p i p i+1 ) is inside the region (p i p i+1 ) ( p i+1 p i ) But no change to breakpoint status since pi+1 p i = ( p i ) ( p i+1 ) The pair is separated if it is on the reversal region boundary Thus a reversal changes the breakpoint status of at most two pairs 17 / 40

Breakpoint Theorem Breakpoints(P) = the number of breakpoints in P Theorem: For any signed permutation P, d rev (P) Breakpoints(P)/2 Proof Breakpoints(I ) = 0 for identity permutation I A reversal eliminates at most two breakpoints At least Breakpoints(P)/2 reversals are needed to reduce Breakpoints(P) to zero 18 / 40

Breakpoint Theorem: Example Earlier we saw that d rev (P) 4 for P = ( 4 +3 +6 5 +1 +2) Since Breakpoints(P) = 6, the breakpoint theorem shows that d rev (P) 3 If d rev (P) = 3, there would have to be a series of three reversals each of which eliminates exactly two breakpoints Is there one? Otherwise d rev (P) = 4 19 / 40

How good is Breakpoint Theorem? Consider a permutation P = (+n +(n 1) +1) Requires n + 1 reversals to sort: d rev (P) = n + 1 Breakpoints(P) = n + 1 Breakpoint Theorem: d rev (P) (n + 1)/2 Factor of two too small lower bound Is this good? 20 / 40

Breakpoint Reversal Sort 1: while Breakpoints(P) > 0 do 2: if There is a reversal r 2 that reduces Breakpoints(P) by two then 3: Perform r 2 4: else if There is a reversal r 1 that reduces Breakpoints(P) by one then 5: Perform r 1 6: else 7: Perform a reversal that does not increase Breakpoints(P) 21 / 40

Breakpoint Reversal Sort: Analysis If Breakpoints(P) > 0, there is always a reversal that does not increase Breakpoints(P) Reversal boundaries at breakpoints If P contains negative signs, there is always a reversal that decreases Breakpoints(P) Proof as exercise Thus two consecutive reversals by the algorithm always decreases Breakpoints(P) by at least one The number of reversals is at most 2 Breakpoints(P) Since d rev (P) Breakpoints(P)/2, the approximation ratio is at most 4 Is this good? 22 / 40

Computing reversal distance The breakpoint analysis gives lower and upper bounds for reversal distance that are no more than a factor of four apart Often much closer bounds There exists a linear time algorithm for computing the reversal distance but it is much more complicated (Bader, Moret & Yan, 2001) However, allowing operations other than reversals makes the problem easier! 23 / 40

Other genomic rearrangements Fission: Split a chromosome into two Fusion: Concatenate two chromosomes into one Translocation: Split two chromosomes and join them together differently (+1 +2 +3 +4 +5) (+6 +7 +8 +9) (+1 +2 +8 +9) (+6 +7 +3 +4 +5) More common than fusion or fission 24 / 40

Synteny Block Graph By convention, the synteny blocks are now labelled with letters rather than numbers + and signs are still used For each synteny block, the graph has two nodes connected by a directed edge indicating the orientation of the block For each chromosome, the directed edges are connected by undirected edges in the order they appear in the chromosome Example: P = (+a b c +d) a b c d Nodes can be moved around without losing information a b c d 25 / 40

Cyclic Synteny Block Graph Arrange block edges on a circle Add a connecting edge to complete the cycle As if the chromosome was circular Example: P = (+a b c +d) b a c d 26 / 40

Reversal in Synteny Block Graph Reversal operation replaces a pair of connecting edges with a different pair of edges between the same nodes Example: Reversal transform of (+a b c +d) into (+a b d +c) b b a c Reversal a c d d 27 / 40

Fission and Fusion in Synteny Block Graph Fission and fusion too are similar edge replacements Example: Fission of (+a b c +d) into (+a b) ( c +d) and the inverse fusion b b Fission a c a c Fusion d d 28 / 40

2-Breaks The basic synteny block graph operation is 2-break: Remove two connecting (undirected) edges Add two edges connecting the same four nodes differently As we saw, a single 2-break can cause a reversal, a fission or a fusion 29 / 40

Translocation in Synteny Block Graph Translocation too can be implemented by a 2-break but on a graph without the edges that connect chromosome endpoints Example: Translocation of (+a b) ( c +d) into (+a +d) ( c b) b b a c Translocation a c d d 30 / 40

2-Break Distance For simplicity, we consider only fully cyclic graphs Corresponds to genomes with circular chromosomes We are interested in the minimal number of 2-breaks needed to transform one graph into the other Corresponds to the minimal number of reversals, fissions and fusions to transform one genome to the other This number is the 2-break distance 31 / 40

2-Break Distance Problem Goal: Compute the 2-break distance between genomes Input: Two genomes P and Q with circular chromosomes on the same set of synteny blocks Output: The 2-break distance d(p, Q) between P and Q 32 / 40

Breakpoint Graph The breakpoint graph Breakpoint(P, Q) is a merging of the synteny block graphs for P and Q Shared block edges (which can be omitted after construction) Separate connecting edges Example: Breakpoint(P, Q) for P = (+a b c +d) and Q = (+a +c +b d) b a c d 33 / 40

Cycles in Breakpoint Graph A breakpoint graph contains cycles with alternating red and blue edges Let Cycles(P, Q) denote the number of such cycles in Breakpoint(P, Q) Example: Cycles(P, Q) = 2 for P = (+a b c +d) and Q = (+a +c +b d) 34 / 40

Cycles and 2-Breaks A 2-break can increase the number of cycles by one, decrease the number of cycles by one or not change the number of cycles no change increase decrease Note that the alternating red blue cycles are different from the cycles representing chromosomes. Thus a 2-break splitting a red-blue cycle does not necessarily represent a fission. 35 / 40

Cycle Theorem Theorem: For any genomes P and Q, any 2-break applied to P or to Q can increase Cycles(P, Q) by at most one. Proof A 2-break removes two edges and adds two edges. The first removal of an edge breaks a cycle. (The second removal might be in the same cycle.) Each addition of an edge can create at most one cycle. The total increase in the number of cycles is at most 1 + 2 = 1 36 / 40

Cycles in Identical Genomes Let Blocks(P, Q) be the number of synteny blocks (shared by P and Q) Cycles(P, P) = Blocks(P, Q) 37 / 40

Splitting Cycles A cycle longer than two can always be split into two cycles by a 2-break. Remove any two edges on the cycle and replace them appropriately. If Cycles(P, Q) < Blocks(P, Q), there must be at least one cycle longer than two, and thus there exists a 2-break that increases the number of cycles. 38 / 40

Cycles and 2-Break Distance Theorem: For any genomes P and Q, d(p, Q) = Blocks(P, Q) Cycles(P, Q). Proof Consider a series of 2-breaks that change Q into P The series must have at least Cycles(P, P) Cycles(P, Q) 2-breaks, because each 2-break can add at most one cycle. On the other hand, there exists a series of no more than Cycles(P, P) Cycles(P, Q) 2-breaks, because we can always find a 2-break that increases the number of cycles. Thus d(p, Q) = Cycles(P, P) Cycles(P, Q) = Blocks(P, Q) Cycles(P, Q). 39 / 40

2-Break Distance vs. Reversal Distance 2-break distance is simple to compute Reversal distance is complicated to compute Study groups Adding the extra operations, fission and fusion, helped make the problem simpler 40 / 40