Genome Rearrangements:

Similar documents
I1 = I2 I1 = I2 + I3 I1 + I2 = I3 + I4 I 3

Combinatorial Problems and Algorithms in Comparative Genomics. Max Alekseyev

Designing finite automata II

Recent advances in analysis of evolutionary transpositions

p-adic Egyptian Fractions

How do we solve these things, especially when they get complicated? How do we know when a system has a solution, and when is it unique?

Surface maps into free groups

First Midterm Examination

Parse trees, ambiguity, and Chomsky normal form

Connected-components. Summary of lecture 9. Algorithms and Data Structures Disjoint sets. Example: connected components in graphs

Hamiltonian Cycle in Complete Multipartite Graphs

Farey Fractions. Rickard Fernström. U.U.D.M. Project Report 2017:24. Department of Mathematics Uppsala University

First Midterm Examination

Torsion in Groups of Integral Triangles

Genetic Programming. Outline. Evolutionary Strategies. Evolutionary strategies Genetic programming Summary

Intermediate Math Circles Wednesday, November 14, 2018 Finite Automata II. Nickolas Rollick a b b. a b 4

1 Nondeterministic Finite Automata

Lecture 2e Orthogonal Complement (pages )

Lecture 3. In this lecture, we will discuss algorithms for solving systems of linear equations.

Lecture 2: January 27

Convert the NFA into DFA

MORE FUNCTION GRAPHING; OPTIMIZATION. (Last edited October 28, 2013 at 11:09pm.)

Bridging the gap: GCSE AS Level

Review of Gaussian Quadrature method

Math 1B, lecture 4: Error bounds for numerical methods

Formal languages, automata, and theory of computation

CS415 Compilers. Lexical Analysis and. These slides are based on slides copyrighted by Keith Cooper, Ken Kennedy & Linda Torczon at Rice University

Alignment of Long Sequences. BMI/CS Spring 2016 Anthony Gitter

Section 6: Area, Volume, and Average Value

How do we solve these things, especially when they get complicated? How do we know when a system has a solution, and when is it unique?

Recitation 3: More Applications of the Derivative

Regular expressions, Finite Automata, transition graphs are all the same!!

The Minimum Label Spanning Tree Problem: Illustrating the Utility of Genetic Algorithms

Review of Calculus, cont d

2.4 Linear Inequalities and Interval Notation

CMDA 4604: Intermediate Topics in Mathematical Modeling Lecture 19: Interpolation and Quadrature

Ragout a reference-assisted assembly tool for bacterial genomes

Chapter 4: Techniques of Circuit Analysis. Chapter 4: Techniques of Circuit Analysis

We will see what is meant by standard form very shortly

1B40 Practical Skills

Linear Inequalities. Work Sheet 1

MA123, Chapter 10: Formulas for integrals: integrals, antiderivatives, and the Fundamental Theorem of Calculus (pp.

Fast Frequent Free Tree Mining in Graph Databases

19 Optimal behavior: Game theory

12.1 Nondeterminism Nondeterministic Finite Automata. a a b ε. CS125 Lecture 12 Fall 2016

Math 8 Winter 2015 Applications of Integration

Lecture 3: Equivalence Relations

5. (±±) Λ = fw j w is string of even lengthg [ 00 = f11,00g 7. (11 [ 00)± Λ = fw j w egins with either 11 or 00g 8. (0 [ ffl)1 Λ = 01 Λ [ 1 Λ 9.

CSCI 340: Computational Models. Kleene s Theorem. Department of Computer Science

expression simply by forming an OR of the ANDs of all input variables for which the output is

P 3 (x) = f(0) + f (0)x + f (0) 2. x 2 + f (0) . In the problem set, you are asked to show, in general, the n th order term is a n = f (n) (0)

1 From NFA to regular expression

Riemann is the Mann! (But Lebesgue may besgue to differ.)

Chapter Five: Nondeterministic Finite Automata. Formal Language, chapter 5, slide 1

List all of the possible rational roots of each equation. Then find all solutions (both real and imaginary) of the equation. 1.

Bases for Vector Spaces

1. For each of the following theorems, give a two or three sentence sketch of how the proof goes or why it is not true.

USA Mathematical Talent Search Round 1 Solutions Year 21 Academic Year

Lecture Note 9: Orthogonal Reduction

8 Laplace s Method and Local Limit Theorems

4 7x =250; 5 3x =500; Read section 3.3, 3.4 Announcements: Bell Ringer: Use your calculator to solve

CS 301. Lecture 04 Regular Expressions. Stephen Checkoway. January 29, 2018

Duality # Second iteration for HW problem. Recall our LP example problem we have been working on, in equality form, is given below.

Chapter 14. Matrix Representations of Linear Transformations

Suppose we want to find the area under the parabola and above the x axis, between the lines x = 2 and x = -2.

Lecture 08: Feb. 08, 2019

Read section 3.3, 3.4 Announcements:

Equations and Inequalities

0.1 THE REAL NUMBER LINE AND ORDER

Properties of Integrals, Indefinite Integrals. Goals: Definition of the Definite Integral Integral Calculations using Antiderivatives

CS103B Handout 18 Winter 2007 February 28, 2007 Finite Automata

Fault Modeling. EE5375 ADD II Prof. MacDonald

Minimal DFA. minimal DFA for L starting from any other

Student Activity 3: Single Factor ANOVA

NFA DFA Example 3 CMSC 330: Organization of Programming Languages. Equivalence of DFAs and NFAs. Equivalence of DFAs and NFAs (cont.

Lecture 2 : Propositions DRAFT

Nondeterminism and Nodeterministic Automata

Riemann Sums and Riemann Integrals

Chapter 0. What is the Lebesgue integral about?

UNIFORM CONVERGENCE. Contents 1. Uniform Convergence 1 2. Properties of uniform convergence 3

Bayesian Networks: Approximate Inference

NUMERICAL INTEGRATION

State space systems analysis (continued) Stability. A. Definitions A system is said to be Asymptotically Stable (AS) when it satisfies

Section 6.1 Definite Integral

Converting Regular Expressions to Discrete Finite Automata: A Tutorial

CIRCULAR COLOURING THE PLANE

Chapters Five Notes SN AA U1C5

CS 330 Formal Methods and Models

Riemann Sums and Riemann Integrals

Lecture 9: LTL and Büchi Automata

CMPSCI 250: Introduction to Computation. Lecture #31: What DFA s Can and Can t Do David Mix Barrington 9 April 2014

7.1 Integral as Net Change and 7.2 Areas in the Plane Calculus

5.7 Improper Integrals

CMSC 330: Organization of Programming Languages

DISCRETE MATHEMATICS HOMEWORK 3 SOLUTIONS

Haplotype Frequencies and Linkage Disequilibrium. Biostatistics 666

Discrete Mathematics and Probability Theory Spring 2013 Anant Sahai Lecture 17

CS103 Handout 32 Fall 2016 November 11, 2016 Problem Set 7

Individual Contest. English Version. Time limit: 90 minutes. Instructions:

and that at t = 0 the object is at position 5. Find the position of the object at t = 2.

Transcription:

Genome Rerrngements: from Biologicl Prolem to Comintoril Algorithms (nd ck) Pvel Pevzner Deprtment of Computer Science, University of Cliforni t Sn Diego

Genome Rerrngements Mouse X chromosome Unknown ncestor ~ 80 M yers go Humn X chromosome ü Wht is the evolutionry scenrio for trnsforming one genome into the other? ü Wht is the orgniztion of the ncestrl genome? ü Are there ny rerrngement hotspots in mmmlin genomes?

Genome Rerrngements: Evolutionry Scenrios Unknown ncestor ~ 80 M yers go ü Wht is the evolutionry scenrio for trnsforming one genome into the other? ü Wht is the orgniztion of the ncestrl genome? ü Are there ny rerrngement hotspots in mmmlin genomes? Reversl flips segment of chromosome

Genome Rerrngements: Ancestrl Reconstruction ü Wht is the evolutionry scenrio for trnsforming one genome into the other? ü Wht is the orgniztion of the ncestrl genome? ü Are there ny rerrngement hotspots in mmmlin genomes? Bourque, Tesler, PP, Genome Res.

Genome Rerrngements: Evolutionry Erthqukes ü Wht is the evolutionry scenrio for trnsforming one genome into the other? ü Wht is the orgniztion of the ncestrl genome? ü Are there ny rerrngement hotspots in mmmlin genomes?

Rerrngement Hotspots in Tumor Genomes ü Rerrngements my disrupt genes nd lter gene regultion. ü Exmple: rerrngement in leukemi yields Phildelphi chromosome: Chr 9 Chr 22 promoter ABL gene promoter BCR gene promoter BCR gene promoter c-1 oncogene ü Thousnds of rerrngements hotspots known for different tumors.

Rerrngement Hotspots in Tumor Genomes MCF7 rest cncer cell line ü Wht is the evolutionry scenrio for trnsforming one genome into the other? ü Wht is the orgniztion of the ncestrl genome? ü Where re the rerrngement hotspots in mmmlin genomes?

Controversy: Evolution vs. Intelligent Design

Three Evolutionry Controversies Primte - Rodent - Crnivore Split Rerrngement Hotspots Whole Genome Duplictions

Primte Rodent Crnivore Split rodent-crnivore split primte-crnivore split Primte - Rodent - Crnivore Split primte-rodent split

Primte Rodent Crnivore Split (who is closer to us: mouse or dog?) rodent-crnivore split primte-crnivore split Primte - Rodent - Crnivore Split primte-rodent split

Primte Rodent vs. Primte Crnivore Split Hutley et l., MBE, My 07: We hve demonstrted with very high confidence tht the rodents diverged efore crnivores nd primtes Lunter, PLOS CB, April 07: It ppers unjustified to continue to consider the phylogeny of primtes, rodents, nd cnines s contentious primte-rodent split primte-crnivore split Arnson et l, PNAS 02 Cnrozzi, PLOS CB 06 Murphy et l,, Science 01 Kumr&Hedges, Nture, 98

Reconstruction of Ancestrl Genomes: Humn / Mouse / Rt

Cn Rerrngement Anlysis Resolve the Primte Rodent Crnivore Controversy?

Susumu Ohno: Two Hypothesis Ohno, 1970, 1973 Rerrngement Hotspots Whole Genome Duplictions ü Rndom Brekge Hypothesis: Genomic rchitectures re shped y rerrngements tht occur rndomly (no frgile regions). ü Whole Genome Dupliction (WGD) Hypothesis: Big leps in evolution would hve een impossile without whole genome duplictions.

Rndom Brekge Model (RBM) Rerrngement Hotspots ü The rndom rekge hypothesis ws emrced y iologists nd hs ecome de fcto theory of chromosome evolution. ü Ndeu & Tylor, Proc. Nt'l Acd. Sciences 1984 ü ü ü First convincing rguments in fvor of the Rndom Brekge Model (RBM) RBM implies tht there is no rerrngement hotspots RBM ws re-iterted in hundreds of ppers

Rndom Brekge Model (RBM) Rerrngement Hotspots ü The rndom rekge hypothesis ws emrced y iologists nd hs ecome de fcto theory of chromosome evolution. ü Ndeu & Tylor, Proc. Nt'l Acd. Sci. 1984 ü ü ü First convincing rguments in fvor of the Rndom Brekge Model (RBM) RBM implies tht there is no rerrngement hotspots RBM ws re-iterted in hundreds of ppers ü PP & Tesler, Proc. Nt'l Acd. Sci. 2003 ü ü ü Rejected RBM nd proposed the Frgile Brekge Model (FBM) Postulted existence of rerrngement hotspots nd vst rekpoint re-use FBM implies tht the humn genome is mosic of solid nd frgile regions.

Are the Rerrngement Hotspots Rel? üthe Frgile Brekge Model did not live long: in 2003 Dvid Snkoff presented convincing rguments ginst FBM: we hve shown tht rekpoint re-use of the sme mgnitude s found in Pevzner nd Tesler, 2003 my very well e rtifcts in context where NO re-use ctully occurred.

Are the Rerrngement Hotspots Rel? üthe Frgile Brekge Model did not live long: in 2003 Dvid Snkoff presented rguments ginst FBM (Snkoff & Trinh, J. Comp. Biol, 2005) we hve shown tht rekpoint re-use of the sme mgnitude s found in Pevzner nd Tesler, 2003 my very well e rtifcts in context where NO re-use ctully occurred. Before you criticize people, you should wlk mile in their shoes. Tht wy, when you criticize them, you re mile wy. And you hve their shoes. J.K. Lmert

Reuttl of the Reuttl of the Reuttl n Peng et l., 2006 (PLOS Computtionl Biology) found n error in Snkoff-Trinh reuttl: If Snkoff & Trinh fixed their ST-Synteny lgorithm, they would confirm rther thn reject Pevzner-Tesler s Frgile Brekge Model n Snkoff, 2006 (PLOS Computtionl Biology): Not only did we foist hstily conceived nd incorrectly executed simultion on n overworked RECOMB conference progrm committee, ut worse nostr mxim culp we oliged tem of high-powered reserchers to clen up fter us!

Controversy continues n Rndom Brekge Model controversy: While Snkoff cknowledged the flw in his rguments ginst RBM, he ppers reluctnt to cknowledge the reuttl of RBM, this time rguing tht more complex rerrngement events (e.g., trnspositions) my crete n ppernce of rekpoint re-use. ü Most recent studies support the Frgile Brekge Model: vn der Wind, 2004, Biley, 2004, Zho et l., 2004, Murphy et l., 2005, Hinsch & Hnnenhlli, 2006, Ruiz-Herrer et l., 2006, Yue & Hf, 2006, Mehn et l., 2007, etc Kikut et l., Genome Res. 2007:... the Ndeu nd Tylor hypothesis is not possile for the explntion of synteny in rt.

Reuttl of the Reuttl of the Reuttl: Controversy Resolved... Or Ws It? ü Nerly ll recent studies support the Frgile Brekge Model: vn der Wind, 2004, Biley, 2004, Zho et l., 2004, Murphy et l., 2005, Hinsch & Hnnenhlli, 2006, Ruiz-Herrer et l., 2006, Yue & Hf, 2006, Mehn et l., 2007, etc Kikut et l., Genome Res. 2007:... the Ndeu nd Tylor hypothesis is not possile for the explntion of synteny

Rndom vs. Frgile Brekge Dete Continues: Complex Rerrngements ü PP & Tesler, PNAS 2003, rgued tht every evolutionry scenrio for trnsforming Mouse into Humn genome with reversls, trnsloctions, fusions, nd fissions must result in lrge numer of rekpoint re-uses, contrdiction to the RBM. ü Snkoff, PLoS Comp. Biol. 2006: We cnnot infer whether mutully rndomized synteny lock orderings derived from two divergent genomes were creted... through processes other thn reversls nd trnsloctions.

Whole Genome Dupliction (WGD) Whole Genome Duplictions

Genome Rerrngements After WGD

Genome Rerrngements After WGD

Genome Rerrngements After WGD

Whole Genome Dupliction Hypothesis Ws Confirmed After Yers of Controversy üthe Whole Genome Dupliction hypothesis first met with skepticism ut ws finlly confirmed y Kellis et l., Nture 2004: Our nlysis resolves the longstnding controversy on the ncestry of the yest genome. n There ws whole-genome dupliction. Wolfe, Nture, 1997 n There ws no whole-genome dupliction. Dujon, FEBS, 2000 n Duplictions occurred independently Lngkjer, JMB, 2000 n Continuous duplictions Dujon, Yest 2003 n Multiple duplictions Friedmn, Gen. Res, 2003 n Spontneous duplictions Koszul, EMBO, 2004

Whole Genome Dupliction Hypothesis Ws Eventully Confirmed...But Some Scientists Are Not Convinced ü Kellis, Birren & Lnder, Nture 2004: Our nlysis resolves the longstnding controversy on the ncestry of the yest genome. ü Mrtin et l., Biology Direct 2007: We elieve tht the proposl of Whole Genome Dupliction in the yest linege is unwrrnted. ü To ddress Mrtin et l., 2007 rguments ginst WGD, it would e useful to reconstruct the preduplicted genomes. n There ws whole-genome dupliction. Wolfe, Nture, 1997 n There ws no whole-genome dupliction. Dujon, FEBS, 2000 n Duplictions occurred independently Lngkjer, JMB, 2000 n Continuous duplictions Dujon, Yest 2003 n Multiple duplictions Friedmn, Gen. Res, 2003 n Spontneous duplictions Koszul, EMBO, 2004 n Our nlysis resolved the controversy Kellis, Nture, 2004 n WGD in the yest linege is unwrrnted Mrtin, Biology Direct, 2007 n...

Genome Hlving Prolem Reconstruction of the pre-duplicted ncestor: Genome Hlving Prolem

From Biologicl Controversies to Algorithmic Prolems Primte - Rodent - Crnivore Split Ancestrl Genome Reconstruction Rerrngement Hotspots Brekpoint Re-use Anlysis Whole Genome Duplictions Genome Hlving Prolem

Algorithmic Bckground: Genome Rerrngements nd Brekpoint Grphs

Unichromosoml Genomes: Reversl Distnce Sorting y reversls: find the shortest series of reversls trnsforming one uni-chromosoml genome into nother. The numer of reversls in such shortest series is the reversl distnce etween genomes. Hnnenhlli nd PP. (FOCS 1995) gve polynomil lgorithm for computing the reversl distnce.. Step 0: 2-4 -3 5-8 -7-6 1 Step 1: 2 3 4 5-8 -7-6 1 Step 2: -5-4 -3-2 -8-7 -6 1 Step 3: -5-4 -3-2 -1 6 7 8 Step 4: 1 2 3 4 5 6 7 8 Sorting y prefix reversls (pncke flipping prolem), Gtes nd Ppdimitriou, Discrete Appl. Mth. 1976

Sorting y reversls Most prsimonious scenrios Step 0: 2-4 -3 5-8 -7-6 1 Step 1: 2 3 4 5-8 -7-6 1 Step 2: -5-4 -3-2 -8-7 -6 1 Step 3: -5-4 -3-2 -1 6 7 8 Step 4: 1 2 3 4 5 6 7 8 The reversl distnce is the minimum numer of reversls required to trnsform one gene order into nother. Here, the distnce is 4.

Multichromosoml Genomes: Genomic Distnce ü Genomic Distnce etween two genomes is the minimum numer of reversls, trnsloctions, fusions, nd fissions required to trnsform one genome into nother. ü Hnnenhlli nd PP (STOC 1995) gve polynomil lgorithm for computing the genomic distnce. ü These lgorithms were followed y mny improvements: Kpln et l. 1999, Bder et l. 2001, Tesler 2002, Ozery-Flto & Shmir 2003, Tnnier & Sgot 2004, Bergeron 2001-07, etc.

HP Theory Is Rther Complicted: Is There Simpler Alterntive? ü HP theory is key tool in most genome rerrngement studies. However, it is rther complicted mking it difficult to pply it in complex setups such s the RBM vs. FBM or WGD controversies. ü To study the outlined evolutionry controversies, we use n lterntive (simpler) pproch clled k-rek nlysis

Simplifying HP Theory: from Liner to Circulr Chromosomes d c A chromosome is represented s cycle with directed red nd undirected lck edges: c d red edges encode locks (genes) lck edges connect djcent locks

Reversls on Circulr Chromosomes c reversl c d d c d d c A reversl replces two lck edges with two other lck edges

Fissions c fission c d d c d c d ü A fission splits single cycle (chromosome) into two. ü A fissions replces two lck edges with two other lck edges.

Trnsloctions / Fusions c fusion c d d c d c d ü A trnsloction/fusion merges two cycles (chromosomes) into single one. ü They lso replce two lck edges with two other lck edges.

2-Breks c 2-rek c d d ü A 2-Brek replces ny 2 lck edges with nother 2 lck edges forming mtching on the sme 4 vertices. ü Reversls, trnsloctions, fusions, nd fissions represent ll possile types of 2-reks (introduced s DCJ opertions y Yncopoulos et l., 2005)

2-Brek Distnce ü The 2-rek distnce d 2 (P,Q) is the minimum numer of 2-reks required to trnsform P into Q. ü In contrst to the genomic distnce, the 2-rek distnce is esy to compute.

Two Genomes s Blck-Red nd Green-Red Cycles P c c d d c Q c d d

Q-style representtion of P P d c c c Q d d

Q-style representtion of P P d c c c Q d d

Brekpoint Grph: Superposition of Genome Grphs Brekpoint Grph P d c c (Bfn & PP, FOCS 1994) c G(P,Q) Q d d

Brekpoint Grph: GLUING Red Edges with the Sme Lels Brekpoint Grph P d c c (Bfn & PP, FOCS 1994) c G(P,Q) Q d d

Blck-Green Alternting Cycles ü Blck nd green edges form perfect mtchings in the rekpoint grph. Therefore, these edges form collection of lck-green lternting cycles ü Let cycles(p,q) e the numer of lck-green cycles. c d cycles(p,q)=2

Rerrngements Chnge Brekpoint Grphs nd cycle(p,q) Trnsforming genome P into genome Q y 2-reks corresponds to trnsforming the rekpoint grph G(P,Q) into the identity rekpoint grph G(Q,Q). c G(P,Q) d cycles(p,q) = 2 c c G(Q,Q) G(P',Q) trivil cycles d d cycles(p',q) = 3 cycles(q,q) = 4 =#locks

Sorting y 2-Breks 2-reks P=P 0 P 1... P d = Q G(P,Q) G(P 1,Q)... G(Q,Q) cycles(p,q) cycles... #locks cycles # of lck-green cycles incresed y #locks - cycles(p,q) How much ech 2-rek cn contriute to the increse in the numer of cycles?

Ech 2-Brek Increses #Cycles y t Most 1 A 2-rek: ü dds 2 new lck edges nd thus cretes t most 2 new cycles (contining two new lck edges) ü removes 2 lck edges nd thus destroys t lest 1 old cycle (contining two old edges): chnge in the numer of cycles: Δcycles 2-1=1.

2-Brek Distnce ü Any 2-rek increses the numer of cycles y t most one (Δcycles 1) ü Any non-trivil cycle cn e split into two cycles with 2- rek (Δcycles = 1) ü Every sorting y 2-reks must increse the numer of cycles y #locks - cycles(p,q) ü The 2-rek distnce etween genomes P ndq: d 2 (P,Q) = #locks - cycles(p,q) (cp. Yncopoulos et l., 2005, Bergeron et l., 2006)

Complex Rerrngements: Trnspositions ü Sorting y Trnspositions Prolem: find the shortest sequence of trnspositions trnsforming one genome into nother. ü First 1.5-pproximtion lgorithm ws given y Bfn nd P.P. SODA 1995. The est known result: 1.375-pproximtion lgorithm of Elis nd Hrtmn, 2005. ü The complexity sttus remins unknown.

Trnspositions c trnsposition c d c d c d d Trnspositions cut off segment of one chromosome nd insert it t some position in the sme or nother chromosome

Trnspositions Are 3-Breks c 3-rek c d d ü 3-Brek replces ny triple of lck edges with nother triple forming mtching on the sme 6 vertices. ü Trnspositions re 3-Breks.

Sorting y 3-Breks 3-reks P=P 0 P 1... P d = Q G(P,Q) G(P 1,Q)... G(Q,Q) cycles(p,q) cycles... #locks cycles # of lck-green cycles incresed y #locks - cycles(p,q) How much ech 3-rek cn contriute to this increse?

Ech 3-Brek Increses #Cycles y t Most 2 A 3-Brek: ü dds 3 new lck edges nd thus cretes t most 3 new cycles (contining three new lck edges) ü removes 3 lck edges nd thus destroys t lest 1 old cycle (contining three old edges): chnge in the numer of cycles: Δcycle 3-1=2.

3-Brek Distnce ü Any 3-rek increses the numer of cycles y t most TWO (Δcycles 2) ü Any non-trivil cycle cn e split into three cycles with 3-rek (Δcycles = 2) ü Every sorting y 3-reks must increse the numer of cycles y #locks - cycles(p,q) ü The 3-rek distnce etween genomes P nd Q: d 3 (P,Q) = (#locks - cycles(p,q))/2

3-Brek Distnce ü Any 3-rek increses the numer of cycles y t most TWO (Δcycles 2) ü Any non-trivil cycle cn e split into three cycles with 3-rek (Δcycles = 2) WRONG STATEMENT ü Every sorting y 3-reks must increse the numer of cycles y #locks - cycles(p,q) ü The 3-rek distnce etween genomes P nd Q: d 3 (P,Q) = (#locks - cycles(p,q))/2

3-Brek Distnce: Focus on Odd Cycles ü A 3-rek cn increse the numer of odd cycles (i.e., cycles with odd numer of lck edges) y t most 2 (Δcycles odd 2) ü A non-trivil odd cycle cn e split into three odd cycles with 3-rek (Δcycles odd = 2) ü An even cycle cn e split into two odd cycles with 3- rek (Δcycles odd =2) ü The 3-Brek Distnce etween genomes P ndq is: d 3 (P,Q) = ( #locks cycles odd (P,Q) ) / 2

Multi-Brek Rerrngements ü k-brek rerrngement opertion mkes k reks in genome nd glues the resulting pieces in new order. ü Rerrngements re rre evolutionry events nd iologists elieve tht k-rek rerrngements re unlikely for k>3 nd reltively rre for k=3 (t lest in the mmmlin evolution). ü Also, in rdition iology, chromosome errtions for k>2 (indictive of chromosome dmge rther thn evolutionry vile vritions) re more common, e.g., complex rerrngements in irrdited humn lymphocytes (Schs et l., 2004; Levy et l., 2004).

Polynomil Algorithm for Multi-Brek Rerrngements ü The formuls for k-rek distnce ecome complex s k grows (without n ovious pttern): The formul for d 20 (P,Q) contins over 1,500 terms! Alekseyev & PP (SODA 07, Theoreticl Computer Science 08) Exct formuls for the k-rek distnce s well s liner-time lgorithm for computing it.

Where Do We Go From Here? Ancestrl Genome Reconstruction Brekpoint Re-use Anlysis Skip Genome Hlving Prolem

Brekpoint Re-use Anlysis Algorithmic Prolem Serching for Rerrngement Hotspots (Frgile Regions) in Humn Genome

Rndom vs. Frgile Brekge Dete: Complex Rerrngements ü PP & Tesler, PNAS 2003, rgued tht every evolutionry scenrio for trnsforming Mouse into Humn genome with reversls nd trnsloctions must result in lrge numer of rekpoint re-uses, contrdiction to the RBM. ü Snkoff, PLoS CB 2006: We cnnot infer whether mutully rndomized synteny lock orderings derived from two divergent genomes were creted... through processes other thn reversls nd trnsloctions. (red: trnspositions or 3-reks)

Rndom Brekge Theory Re-Re-Re-Re-Re-Visited n Ohno, 1970, Ndeu & Tylor, 1984 introduced RBM n Pevzner & Tesler, 2003 (PNAS) rgued ginst RBM n Snkoff & Trinh, 2004 (RECOMB, JCB) rgued ginst Pevzner & Tesler, 2003 rguments ginst RBM n Peng et l., 2006 (PLOS CB) rgued ginst Snkoff & Trinh, 2004 rguments ginst Pevzner & Tesler, 2003 rguments ginst RBM n Snkoff, 2006 (PLOS CB) cknowledged n error in Snkoff&Trinh, 2004 ut cme up with new rgument ginst Pevzner nd Tesler, 2003 rguments ginst RBM n Alekseyev nd Pevzner, 2007 (PLOS CB) rgued ginst Snkoff, 2006 new rgument ginst Pevzner & Tesler, 2003 rguments ginst RBM

Rndom Brekge Theory Re-Re-Re-Re-Re-Visited n Ohno, 1970, Ndeu & Tylor, 1984 introduced RBM n Pevzner & Tesler, PNAS 2003 rgued ginst RBM n Snkoff & Trinh, JCB 2004 rgued ginst Pevzner & Tesler, 2003 rguments ginst RBM n Peng et l., PLOS CB 2006 rgued ginst Snkoff & Trinh, 2004 rguments ginst Pevzner & Tesler, 2003 rguments ginst RBM n Snkoff, PLOS CB 2006 cknowledged n error in Snkoff&Trinh, 2004 ut cme up with new rgument ginst Pevzner nd Tesler, 2003 rguments ginst RBM n Alekseyev nd Pevzner, PLOS CB 2007 rgued ginst Snkoff, 2006 new rgument ginst Pevzner & Tesler, 2003 rguments ginst RBM

Are There Rerrngement Hotspots in Humn Genome?

Are There Rerrngement Hotspots Theorem. Yes in Humn Genome?

Are There Rerrngement Hotspots in Theorem. Yes Humn Genome? Proof: The formul for the 3-rek distnce implies tht there were t lest d 3 (Humn,Mouse) = 139 rerrngements etween humn nd mouse (including trnspositions)

Are There Rerrngement Hotspots in Theorem. Yes Humn Genome? Proof: The formul for 3-rek distnce implies tht there were t lest d 3 (Humn,Mouse) = 139 rerrngements etween humn nd mouse (including trnspositions) Every 3-rek cretes up to 3 rekpoints

Are There Rerrngement Hotspots in Theorem. Yes Humn Genome? Proof: The formul for 3-rek distnce implies tht there were t lest d 3 (Humn,Mouse) = 139 rerrngements etween humn nd mouse (including trnspositions) Every 3-rek cretes up to 3 rekpoints If there were no rekpoint re-use then fter 139 rerrngements we my see 139*3=417 rekpoints

Are There Rerrngement Hotspots in Theorem. Yes Humn Genome? Proof: The formul for 3-rek distnce implies tht there were t lest d 3 (Humn,Mouse) = 139 rerrngements etween humn nd mouse (including trnspositions) Every rerrngement cretes up to 3 rekpoints If there were no rekpoint re-use then fter 139 rerrngements we my see 139*3=417 rekpoints But there re only 281 rekpoints etween Humn nd Mouse!

Are There Rerrngement Hotspots in Theorem. Yes Humn Genome? Proof: The formul for 3-rek distnce implies tht there were t lest d 3 (Humn,Mouse) = 139 rerrngements etween humn nd mouse (including trnspositions) Every rerrngement cretes up to 3 rekpoints If there were no rekpoint re-use then fter 139 rerrngements we my see 139*3=417 rekpoints But there re only 281 rekpoints etween Humn nd Mouse Is 417 lrger thn 281?

Are There Rerrngement Hotspots in Theorem. Yes Humn Genome? Proof: The formul for 3-rek distnce implies tht there were t lest d 3 (Humn,Mouse) = 139 rerrngements etween humn nd mouse (including trnspositions) Every rerrngement cretes up to 3 rekpoints If there were no rekpoint re-use then fter 139 rerrngements we my see 139*3=417 rekpoints But there re only 281 rekpoints etween Humn nd Mouse Is 417 lrger thn 281? Yes, 417 >> 281!

Are There Rerrngement Hotspots in Theorem. Yes Humn Genome? Proof: The formul for 3-rek distnce implies tht there were t lest d 3 (Humn,Mouse) = 139 rerrngements etween humn nd mouse (including trnspositions) Every rerrngement cretes up to 3 rekpoints If there were no rekpoint re-use then fter 139 rerrngements we my see 139*3=417 rekpoints But there re only 281 rekpoints etween Humn nd Mouse Is 417 lrger thn 281? Yes, 417 >> 281!

Trnsforming Humn Genome into Mouse Genome y 3-Breks 3-reks Humn = P 0 P 1... P d = Mouse d = d 3 (Humn,Mouse) = 139 üour proof ssumed tht ech 3-rek mkes 3 rekges in genome, so the totl numer of rekges mde in this trnsformtion is 3 * 139 = 417. üoops! The trnsformtion my include 2-reks (s prticulr cse of 3-reks). If every 3-rek were 2-rek then the totl numer of rekges is only 2*139 = 278 < 281, in which cse there could e no rekpoint re-uses t ll.

d 2 (Humn,Mouse) = 246 d 3 (Humn,Mouse) = 139 minimum numer of rekges = 246 + 139 = 385 Minimizing the Numer of Brekges Prolem. Given genomes P nd Q, find series of k-reks trnsforming P into Q nd mking the smllest numer of rekges. Theorem. Any series of k-reks trnsforming P into Q mkes t lest d k (P,Q)+d 2 (P,Q) rekges Theorem. There exists series of d 3 (P,Q) 3-reks trnsforming P into Q nd mking d 3 (P,Q)+d 2 (P,Q) rekges.

Are There Rerrngement Hotspots in Theorem. Yes Humn Genome? Proof: The formul for 3-rek distnce implies tht there were t lest d 3 (Humn,Mouse) = 139 rerrngements etween humn nd mouse (including trnspositions) Every rerrngement cretes up to 3 rekpoints If there were no rekpoint re-use then fter 139 rerrngements we SHOULD SEE AT LEAST 385 rekpoints (rther thn 417 s efore) But there re only 281 rekpoints etween Humn nd Mouse Yes, 385 >> 281!

Brekpoint Re-uses etween Humn nd Mouse Genomes ü Any trnsformtion of Mouse into Humn genome with 3- reks requires t lest 385 rekges, while there re 281 rekpoints. ü So, there re t lest 385-281=104 rekpoint re-uses (re-use rte 1.37) which is significntly higher thn sttisticlly expected in the RBM. ü Men = 1.23 ü Stndrd devition = 0.02 ü Mximum rekpoint reuse rte = 1.33 (oserved once in 100,000 simultions)

Are there rerrngement hotspots in humn genome? ü We showed tht even if 3-rek rerrngements were frequent, the rgument ginst the RBM still stnds (Alekseyev & PP, PLoS Comput. Biol. 2007) We don t elieve tht 3-reks re frequent ut even if they were, we proved tht there exist frgile regions in the humn genome. It is time to nswer more difficult question: Where re the rerrngement hotspots in the humn genome? (the detiled nlysis ppered in Alekseyev nd PP, Genome Biol., Dec.1, 2010)

Where Do We Go From Here? Ancestrl Genome Reconstruction Brekpoint Re-use Anlysis Skip Genome Hlving Prolem

Ancestrl Genome Reconstruction Algorithmic Prolem: Ancestrl Genome Reconstruction nd Multiple Brekpoint Grphs

Ancestrl Genomes Reconstruction ü Given set of genomes, reconstruct genomes of their common ncestors.

Algorithms for Ancestrl Genomes Reconstruction ü GRAPPA: Tng, Moret, Wrnow et l. (2001) ü MGR: Bourque nd PP (Genome Res. 2002) ü InferCARs: M et l. (Genome Res. 2006) ü EMRAE: Zho nd Bourque (Genome Res. 2007) ü MGRA: Alekseyev nd PP (Genome. Res.2009)

Ancestrl Genomes Reconstruction Prolem (with known tree) ü Input: set of k genomes nd phylogenetic tree T ü Output: genomes t the internl nodes of the tree T tht minimize the totl sum of the 2-rek distnces long the rnches of T ü NP-complete in the simplest cse of k=3. ü Wht mkes it hrd?

Ancestrl Genomes Reconstruction Prolem (with known tree) ü Input: set of k genomes nd phylogenetic tree T ü Output: genomes t the internl nodes of the tree T ü Ojective: minimize the totl sum of the 2-rek distnces long the rnches of T ü NP-complete in the simplest cse of k=3. ü Wht mkes it hrd? BREAKPOINTS RE-USE

Brekpoints Are Footprints of Rerrngements on the Ground of Genomes ü Input: set of k genomes nd phylogenetic tree T ü Output: genomes t the internl nodes of the phylogenetic tree T ü Ojective function: the sum of genomic distnces long the rnches of T (ssuming the most prsimonious rerrngement scenrio) ü NP-complete in the simplest cse of k=3. ü Wht mkes it hrd? BREAKPOINTS RE-USES (resulting in messy footprints )! Ancestrl Genome Reconstructions of MANY Genomes (i.e., for lrge k) my e esier to solve.

How to Construct the Brekpoint Grph for Multiple Genomes? P c R d d c c Q d

Constructing Multiple Brekpoint Grph: rerrnging P in the Q order P c R d d c c c Q d d

Constructing Multiple Brekpoint Grph: rerrnging R in the Q order P c R d d c c c Q d d

Multiple Brekpoint Grph: Still Gluing Red Edges with the Sme Lels P c R d d c c Q c Multiple Brekpoint Grph G(P,Q,R) d d

Multiple Brekpoint Grph of 6 Mmmlin Genomes Multiple Brekpoint Grph G(M,R,D,Q,H,C) of the Mouse, Rt, Dog, mcque, Humn, nd Chimpnzee genomes.

Two Genomes: Two Wys of Sorting y 2-Breks Trnsforming P into Q with lck 2-reks: P = P 0 P 1... P d-1 P d = Q G(P,Q) G(P 1,Q)... G(P d,q) = G(Q,Q) Trnsforming Q into P with green 2-reks: Q = Q 0 Q 1... Q d = P G(P,Q) G(P,Q 1 )... G(P,Q d ) = G(P,P) Let's mke lck-green chimeric trnsformtion

Trnsforming G(P,Q) into (n unknown!) G(X,X) rther thn into ( known) G(Q,Q) s efore ü Let X e ny genome on pth from P to Q: P = P 0 P 1... P m = X = Q m-d... Q 1 Q 0 = Q ü Sorting y 2-reks is equivlent to finding shortest trnsformtion of G(P,Q) into n identity rekpoint grph G(X,X) of priori unknown genome X: G(P 0,Q 0 ) G(P 1,Q 0 ) G(P 1,Q 1 ) G(P 1,Q 2 )... G(X,X) ü The lck nd green 2-reks my ritrrily lternte.

ü For exmple, the groups {M, R} (Mouse nd Rt) nd {Q,H,C} (mcque, Humn, Chimpnzee) correspond to rnches of T while the groups {M, C} nd {R, D} do not. From 2 Genomes To Multiple Genomes ü We find trnsformtion of the multiple rekpoint grph G(P 1,P 2,...,P k ) into ( priori unknown!) identity multiple rekpoint grph G(X,X,...,X): G(P 1,P 2,...,P k )... G(X,X,...,X) ü The evolutionry tree T defines groups of genomes (to which the sme 2-reks my e pplied simultneously).

When All Relile 2-Breks Are Identified nd Undone ü The multiple rekpoint grph is reduced drmticlly! ü The remining (non-trivil) components cn e processed mnully in the cse-y-cse fshion.

Reconstruction Of The Ancestrl Genomes ü The resulting identity rekpoint grph G(X,X,...,X) defines its underlying genome X. ü The reverse trnsformtion is pplied to the genome X to trnsform it into ech of the originl genomes P 1, P 2,..., P k. ü Since X is pssing through ll internl nodes of T, it defines the ncestrl genomes t these nodes.

Reconstructed X Chromosomes ü The Mouse, Rt, Dog, mcque, Humn, Chimpnzee genomes nd their reconstructed ncestors:

If the Evolutionry Tree Is Unknown ü For the set of 7 mmmlin genomes (Mouse, Rt, Dog, mcque, Humn, Chimpnzee, nd Opossum), the evolutionry tree T ws suject of enduring detes ü Depending on the primte rodent crnivore split, three topologies re possile (only two of them re vile).

If The Evolutionry Tree Is Not Known ü For the set of 7 mmmlin genomes: Mouse, Rt, Dog, mcque, Humn, Chimpnzee, nd Opossum, the evolutionry tree T is eing deted. ü Depending on the primte rodent crnivore split, three topologies re possile (only two of them re vile). ü However, these three topologies shre mny common rnches in T (confident rnches). We cn restrict the trnsformtion only to such rnches in order to simplify the rekpoint grph, not reking n evidence for either of the topologies.

Rerrngement Evidence For The Primte- Crnivore Split ü Ech of these three topologies hs n unique rnch tht supports either lue (primte-rodent), or green (rodent-crnivore), or red (primtecrnivore) hypothesis. M M M O H H H D O D O D ü Rerrngement nlysis supports the primte crnivore split.

ü Rerrngement nlysis supports the primte crnivore split. Rerrngement Evidence For The Primte-Crnivore Split ü Ech of these three topologies hs n unique rnch tht supports either lue (primte-rodent), or green (rodent-crnivore), or red (primtecrnivore) hypothesis. M M M O 11 H 16 24 H H D O D O D

Rerrngement Evidence For The Primte- Crnivore Split ü Ech of these three topologies hs n unique rnch tht supports either lue (primte-rodent), or green (rodent-crnivore), or red (primtecrnivore) hypothesis. We nlyze the rerrngements supporting ech of these M rnches: M M O 11 H 16 24 H H D O D O D ü Rerrngement nlysis supports the primte

Where Do We Go From Here? Ancestrl Genome Reconstruction Brekpoint Re-use Anlysis Skip Genome Hlving Prolem

Genome Hlving Prolem Algorithmic Prolem: Genome Hlving Prolem

WGD nd Genome Hlving Prolem ü Whole Genome Dupliction (WGD) of genome R results in perfect duplicted genome R+R where ech chromosome is douled. ü The genome R+R is sujected to rerrngements tht result in duplicted genome P. ü Genome Hlving Prolem: Given duplicted genome P, find perfect duplicted genome R+R minimizing the rerrngement distnce etween R+R nd P.

Genome Hlving Prolem: Previous Results ü Algorithm for reconstructing pre-duplicted genome ws proposed in series of ppers y El- Mrouk nd Snkoff (culminting in SIAM J Comp., 2004). ü The proof of the Genome Hlving Theorem is rther technicl (spns over 30 pges).

Genome Hlving Theorem ü Found n error in the originl Genome Hlving Theorem for unichromosoml genomes nd proved theorem tht dequtely dels with ll genomes. (Alekseyev & PP, SIAM J. Comp. 2007) ü Suggested new (short) proof of the Genome Hlving Theorem - our proof is 5 pges long. (Alekseyev & PP, IEEE Bioinformtics 2007) ü Proved the Genome Hlving Theorem for the hrder cse of trnsposition-like opertions (Alekseyev & PP, SODA 2007) ü Introduced the notion of the contrcted rekpoint grph tht mkes difficult prolems (like the Genome Hlving Prolem) more trnsprent.

2-Brek Distnce Between Duplicted Genomes 2-Brek Distnce etween Duplicted Genomes:

2-Brek Distnce Between Duplicted Genomes 2-Brek Distnce etween Duplicted Genomes:. Difficulty: The notion of the rekpoint grph is not defined for duplicted genomes (the first in P cn correspond to either the first or the second in Q).

2-Brek Distnce Between Duplicted Genomes 2-Brek Distnce etween Duplicted Genomes: Ide: Estlish correspondence etween the genes in P nd Q nd re-lel the corresponding locks (e.g., s 1, 2, 1, 2 )

2-Brek Distnce Between Duplicted Genomes 2-Brek Distnce etween Duplicted Genomes 1 2 1 2 1 1 2 2 Tret the leled genomes s non-duplicted, construct the rekpoint grph nd compute the 2-rek distnce etween them.

Different Lelings Result in Different Brekpoint Grphs ü There re 2 different lelings of the copies of ech lock. ü One of these lelings (corresponding to rekpoint grph with mximum numer of cycles) is n optiml leling. ü Trying ll possile lelings tkes exponentil time: 2 #locks invoctions of the 2-rek distnce lgorithm. ü Leling Prolem. Construct n optiml leling tht results in the mximum numer of cycles in the rekpoint grph (hence, the smllest 2-rek distnce).

Constructing Brekpoint Grph of Duplicted Genomes P Q

Contrcted Genome Grphs: GLUING Red Edges P Q

Contrcted Brekpoint Grph: Gluing Red Edges Yet Agin P Contrcted Brekpoint Grph G (P,Q) Q

Contrcted Brekpoint Grph of Duplicted Genomes ü Edges of 3 colors: lck, green, nd red. ü Red edges form mtching. ü Blck edges form lck cycles. ü Green edges form green cycles.

Wht is the Reltionship Between the Brekpoint Grph nd the Contrcted Brekpoint Grph?

Every Brekpoint Grph Induces Cycle Decomposition of the Contrcted Brekpoint Grph Why do we cre?

Every Brekpoint Grph Induces Cycle Decomposition of the Contrcted Brekpoint Grph Why do we cre? Becuse optiml leling corresponds to mximum cycle decomposition.

Mximum Cycle Decomposition Prolem Open Prolem 1: Given contrcted rekpoint grph, find its mximum lck-green cycle decomposition.?

Leling Prolem Open Prolem 2: Find rekpoint grph tht induces given cycle decomposition of the contrcted rekpoint grph.?

Computing 2-Brek Distnce Between Duplicted Genomes: Two Prolems rekpoint grph inducing mx cycle decomposition mximum cycle decomposition contrcted rekpoint grph

Computing 2-Brek Distnce Between Duplicted Genomes: Two HARD Prolems rekpoint grph inducing mx cycle decomposition mximum cycle decomposition contrcted rekpoint grph These prolems re difficult nd we do not know how to solve them. However, we solved them in the cse of the Genome Hlving Prolem.

2-Brek Genome Hlving Prolem 2-Brek Genome Hlving Prolem: Given duplicted genome P, find perfect duplicted genome R+R minimizing the 2-rek distnce: d 2 (P,R+R) = #locks - cycles(p,r+r) Minimizing d 2 (P,R+R) is equivlent to finding perfect duplicted genome R+R nd leling of P nd R+R tht mximizes cycles(p,r+r).

Contrcted Brekpoint Grph of Perfect Duplicted Genome P G'(P,R+R) R+R

Contrcted Brekpoint Grph of Perfect Duplicted Genome hs Specil Structure ü Red edges form mtching. ü Blck edges form lck cycles. ü Green edges form cycles in generl cse ut now (for R+R) these cycles re formed y prllel (doule) green edges G'(P,R+R) forming mtching

Finding the Best Contrcted Brekpoint Grph for Given Genome P

Blue Prolem: Optiml Contrcted Brekpoint Grph P

Mgent Prolem: Mximum Cycle Decomposition P

Ornge Prolem: Leling P?

Solutions to?,?, nd? re too complex to e presented in this tlk (SIAM J. Comp.2007) P?

Genome Hlving Prolem: Wht is Left Behind (1) 2-Brek Genome Hlving Prolem for Multichromosoml Genomes (prtilly covered in this tlk) (2) 2-Brek Genome Hlving Prolem for Unichromosoml Genomes (more difficult) (3) A Flw in the El-Mrouk Snkoff Theorem for Unichromosoml Genomes (4) Clssifiction of Unichromosoml Genome (fixing the flw in the El-Mrouk Snkoff Theorem) (5) 3-Brek Genome Hlving Prolem

Acknowledgments Mx Alekseyev

Acknowledgments Mx Alekseyev Dvid Snkoff If you re not criticized, you my not e doing much. Donld Rumsfeld

Acknowledgments Mx Alekseyev Dvid Snkoff If you re not criticized, you my not e doing much. Donld Rumsfeld Glenn Tesler, Mth. Dept., UCSD Qin Peng