Common intervals of genomes. Mathieu Raffinot CNRS LIAFA

Similar documents
Lecture 6: Coding theory

CSE 332. Sorting. Data Abstractions. CSE 332: Data Abstractions. QuickSort Cutoff 1. Where We Are 2. Bounding The MAXIMUM Problem 4

Counting Paths Between Vertices. Isomorphism of Graphs. Isomorphism of Graphs. Isomorphism of Graphs. Isomorphism of Graphs. Isomorphism of Graphs

Prefix-Free Regular-Expression Matching

Global alignment. Genome Rearrangements Finding preserved genes. Lecture 18

CIT 596 Theory of Computation 1. Graphs and Digraphs

CS311 Computational Structures Regular Languages and Regular Grammars. Lecture 6

CS 360 Exam 2 Fall 2014 Name

Factorising FACTORISING.

Outline Data Structures and Algorithms. Data compression. Data compression. Lossy vs. Lossless. Data Compression

Surds and Indices. Surds and Indices. Curriculum Ready ACMNA: 233,

Fast index for approximate string matching

22: Union Find. CS 473u - Algorithms - Spring April 14, We want to maintain a collection of sets, under the operations of:

Fingerprint idea. Assume:

Analysis of Temporal Interactions with Link Streams and Stream Graphs

The DOACROSS statement

XML and Databases. Exam Preperation Discuss Answers to last year s exam. Sebastian Maneth NICTA and UNSW

Nondeterministic Automata vs Deterministic Automata

Algebra 2 Semester 1 Practice Final

for all x in [a,b], then the area of the region bounded by the graphs of f and g and the vertical lines x = a and x = b is b [ ( ) ( )] A= f x g x dx

Now we must transform the original model so we can use the new parameters. = S max. Recruits

Computational Biology Lecture 18: Genome rearrangements, finding maximal matches Saad Mneimneh

Learning Partially Observable Markov Models from First Passage Times

Solutions for HW9. Bipartite: put the red vertices in V 1 and the black in V 2. Not bipartite!

XML and Databases. Outline. 1. Top-Down Evaluation of Simple Paths. 1. Top-Down Evaluation of Simple Paths. 1. Top-Down Evaluation of Simple Paths

6. Suppose lim = constant> 0. Which of the following does not hold?

Data Structures and Algorithm. Xiaoqing Zheng

Project 6: Minigoals Towards Simplifying and Rewriting Expressions

First Midterm Examination

Laboratory for Foundations of Computer Science. An Unfolding Approach. University of Edinburgh. Model Checking. Javier Esparza

Lesson 2.1 Inductive Reasoning

Dynamic Fully-Compressed Suffix Trees

Formal Languages and Automata

Momentum and Energy Review

Logic, Set Theory and Computability [M. Coppenbarger]

18.06 Problem Set 4 Due Wednesday, Oct. 11, 2006 at 4:00 p.m. in 2-106

Data Structures LECTURE 10. Huffman coding. Example. Coding: problem definition

Numbers and indices. 1.1 Fractions. GCSE C Example 1. Handy hint. Key point

University of Sioux Falls. MAT204/205 Calculus I/II

Nondeterministic Finite Automata

CS 188: Artificial Intelligence Spring 2007

Lossless Compression Lossy Compression

On-Line Construction of Compact Directed Acyclic Word Graphs

Running an NFA & the subset algorithm (NFA->DFA) CS 350 Fall 2018 gilray.org/classes/fall2018/cs350/

Plotting Ordered Pairs Using Integers

Graph width-parameters and algorithms

Graph Algorithms. Vertex set = { a,b,c,d } Edge set = { {a,c}, {b,c}, {c,d}, {b,d}} Figure 1: An example for a simple graph

Designing finite automata II

Module 9: Tries and String Matching

Module 9: Tries and String Matching

On Determinisation of History-Deterministic Automata.

Minimal DFA. minimal DFA for L starting from any other

Hybrid Systems Modeling, Analysis and Control

Lecture 2: Cayley Graphs

CS 491G Combinatorial Optimization Lecture Notes

Review: The Riemann Integral Review: The definition of R b

Lecture 11 Binary Decision Diagrams (BDDs)

Alpha Algorithm: A Process Discovery Algorithm

Compression of Palindromes and Regularity.

On Suffix Tree Breadth

AUTOMATA AND LANGUAGES. Definition 1.5: Finite Automaton

Total score: /100 points

Semi-local string comparison

where the box contains a finite number of gates from the given collection. Examples of gates that are commonly used are the following: a b

1 PYTHAGORAS THEOREM 1. Given a right angled triangle, the square of the hypotenuse is equal to the sum of the squares of the other two sides.

Arrow s Impossibility Theorem

CSCI565 - Compiler Design

y = c 2 MULTIPLE CHOICE QUESTIONS (MCQ's) (Each question carries one mark) is...

Section 6.1 Definite Integral

NON-DETERMINISTIC FSA

NFA DFA Example 3 CMSC 330: Organization of Programming Languages. Equivalence of DFAs and NFAs. Equivalence of DFAs and NFAs (cont.

Suppose we want to find the area under the parabola and above the x axis, between the lines x = 2 and x = -2.

1. For each of the following theorems, give a two or three sentence sketch of how the proof goes or why it is not true.

Discrete Structures, Test 2 Monday, March 28, 2016 SOLUTIONS, VERSION α

Lesson 2.1 Inductive Reasoning

Finite State Automata and Determinisation

Paths. Connectivity. Euler and Hamilton Paths. Planar graphs.

CSCI 340: Computational Models. Transition Graphs. Department of Computer Science

CS 373, Spring Solutions to Mock midterm 1 (Based on first midterm in CS 273, Fall 2008.)

Uninformed Search Lecture 4

2.4 Theoretical Foundations

Recent advances in analysis of evolutionary transpositions

Composite Pattern Matching in Time Series

Chapter 2 Finite Automata

Coding Techniques. Manjunatha. P. Professor Dept. of ECE. June 28, J.N.N. College of Engineering, Shimoga.

CAAM 453 NUMERICAL ANALYSIS I Examination There are four questions, plus a bonus. Do not look at them until you begin the exam.

CISC 4090 Theory of Computation

1 Nondeterministic Finite Automata

Algorithm Design and Analysis

Closure Properties of Regular Languages

SIMPLE NONLINEAR GRAPHS

Logarithms LOGARITHMS.

Metaheuristics for the Asymmetric Hamiltonian Path Problem

Intermediate Math Circles Wednesday, November 14, 2018 Finite Automata II. Nickolas Rollick a b b. a b 4

1 The Riemann Integral

Surface maps into free groups

Suffix Trays and Suffix Trists: Structures for Faster Text Indexing

Eigenvectors and Eigenvalues

Aperiodic tilings and substitutions

The Shortest Path Problem Graph Algorithms - 3

Transcription:

Common intervls of genomes Mthieu Rffinot CNRS LIF

Context: omprtive genomis. set of genomes prtilly/totlly nnotte Informtive group of genes or omins? Ex: COG tse

Mny iffiulties! iology Wht re two similr genes? Wht out lterntive spliing? When re two genes lose (notion of istne)? Wht is n interesting luster? sis: pressure seletion > keep genes working together lose How to moel lusters? Grphs / strings? How to ompute those lusters? How to mnge the sets of lusters n extrt useful informtion? Computer siene

One of the simplest moel : genomes s strings of units ommon intervls Simplest se in this moel: 2 genomes! E C Common intervl: one intervl on eh hromosome sme set of gene in eh intervl externls ouns not in the set of gene

E C E C E C

E C E C E C

How mny ommon intervls? X first hromosome, X= x 1 x 2.. x n Y seon hromosome, Y= y 1 y 2.. y m Common lphet, <= mx( X, Y ) Y C Y= y 1 y 2 y m fo(y,1)= C fo(y,2) = C fo(y,3) = C fo (Y,4) = C fo (Y,5) = = 1 = 2 = 3 C = 4 = 1 = 2 C = 3 = 1 C= 2 = 3 C = 1 = 2 =1 Rnk (Y,1) []=3

Int[k] 3 2 1 E Y C Y= y 1 y 2 y m fo(y,1) = C = 1 =2 C = 3 Rnk (Y,2) []=2

Int[k] re neste! They form tree.! 3 2 1 E 2 n vli Int[k] t mx! 2 nm ommon intervls t mximum The oun is rehe!!

How to ientify ll them? Two pprohes iret omputtion (iier) O(nm) ut + Lowest ommon nestor (otherwise O(n m logn) + No struture in the output! + Complexity oes not epen of the input + No inex Fingerprint omputtion on single string + inex+ merge fter + O(n+ L 1 log n + m L 2 log m) (n e worst thn iier) + Struture in the output n possiility of serh of fingerprint + Complexity oes epen of the input + Keep the inex for further omputtions

S = s 1..s N string of length n lphet of size, not fixe (possily O(n)) fingerprint f : set of hrter(s) of sustring s i.. s j Generl prolem: Compute n represent the set of ll fingerprints of S Exmples: {} {} {} {} {,} {,} {,} {,,} {,,} {,,,} {} {} {} {} {,} {,} {,} {,} {,} {,,} {,,} {,,} {,,,}

Mximl lotion <i,j> of f i fingerprint f j not in f, not in f + Numer of mximl lotions: L <= n Complexity of the oun esily rehe ut is usully muh less k = { 1, 2,.., k } w 1 = 1, w k = w k 1 k w k 1 w 2 = 1 ( 2 ) 1, w 3 =( 1 2 1 ) 3 ( 1 2 3 ),... w k. L k = k. (2 k 1) L k = 2 k+1 (k+2) L k =o( w k. L k )

Nming tehnique {,,e,f} = {,,,,e,f,g,h} log +1 e f g h {,,e,f,g} {,,e,g} Nmes = {[1],[2],[3],[4],[5],[6],[7],[8],[9],[10]} Fingerprints ={[7],[9],[10]}

mir, postolio, Lnu, Stt 2003 k istint hrters Chnging hrter: O(log log n) (n new nmes mximum y level) One itertion: n log log n itertions: n log log n Importnt: ifferent set of nmes for eh itertion k=2

Tsur 2005 List of fingerprints: 1 {}, {,}, {,,}, {,}, {,,} {([0],[1]), } {([1],[1]), } {([1],[0]), } 1 1 {([1],[0]), } {([1],[1]), } List of hnges: {([0],[0]), } {([0,0]), } {([0],[1]), } {([1],[1], } {([1],[0]), } {([1],[0]), } {([1],[1]), } Rix sort on the pirs + unique > new nmes

Tsur 2005 List of hnges: {([0],[0]), } {([0],[0]), } {([0],[1]), } {([1],[1], } {([1],[0]), } {([1],[0]), } {([1],[1]), } [2] > ([0],[0]) [3] > ([0],[1]) [4] > ([1],[0]) [5] > ([1],[1]) New list: {[2], } {[2], } {[3], } {[5], } {[4], } {[4], } {[5], } {([2],[2]), C} {([2],[3]),C} New list: {([2],[2]),C} {([2],[3]),C} {([2],[5]),C} {([4],[5]),C} {([4],[4]),C} {([5],[4]),C} Rix sort,...

Tsur 2005 Rix sort: O(n) (oune integers) One itertion : n log No more nme serh! itertions: n log Prolems oes not epen of L istint nmes t eh itertion

Our pproh (2006) Simple sequene: no repete hrter lfo(i) e lfo(4)=e e lfo(2) = e Contente # to the sequene ijetion L / proper prefixes of lfo(i) e e # e # Compute ll lfo(i) of S#

Our pproh (2006) How to lulte ll lfo(i)? lfo(i) # # # # # # # # # # # # #

Our pproh (2006) Nming ll proper prefixes of lfo(i) n lists: Tsur lgorithm Common nmes Simple sequene: O( L log ) Generl sequene: O(n+ L log ) L <= n Fster or s fst s tht of Tsur

Our pproh (2006) Properties n opertions on our nmes unique set of nmes Compute the LCP of two fingerprints in log nmes sorte y lexiogrphi orer of fingerprints

Fingerprint trie Chn et l, ES 2007 O( F ) spe O( F log ) time Serh in O( f log( f / ))

k to ommon intervls: 1) uil the tree for the first sequene: O(n+ L 1 log ) 2) uil the tree for the seon sequene: O(m+ L 2 log ) 3) Merge the two trees! Complexity: O((n+m)+( L 1 + L 2 ) log ) time.

Open prolems Memory spe reution Orer? pproximte fingerprint istne y fingerprints 2 fingerprints