Global alignment. Genome Rearrangements Finding preserved genes. Lecture 18

Similar documents
Computational Biology Lecture 18: Genome rearrangements, finding maximal matches Saad Mneimneh

Prefix-Free Regular-Expression Matching

11/3/13. Indexing techniques. Short-read mapping software. Indexing a text (a genome, etc) Some terminologies. Hashing

Data Structures LECTURE 10. Huffman coding. Example. Coding: problem definition

Algorithms & Data Structures Homework 8 HS 18 Exercise Class (Room & TA): Submitted by: Peer Feedback by: Points:

Automata and Languages

Lecture 6: Coding theory

CMSC 330: Organization of Programming Languages. DFAs, and NFAs, and Regexps (Oh my!)

Finite State Automata and Determinisation

, g. Exercise 1. Generator polynomials of a convolutional code, given in binary form, are g. Solution 1.

On-Line Construction. of Suffix Trees. Overview. Suffix Trees. Notations. goo. Suffix tries

Alignment of Long Sequences. BMI/CS Spring 2016 Anthony Gitter

22: Union Find. CS 473u - Algorithms - Spring April 14, We want to maintain a collection of sets, under the operations of:

Technische Universität München Winter term 2009/10 I7 Prof. J. Esparza / J. Křetínský / M. Luttenberger 11. Februar Solution

NON-DETERMINISTIC FSA

The Riemann-Stieltjes Integral

1 PYTHAGORAS THEOREM 1. Given a right angled triangle, the square of the hypotenuse is equal to the sum of the squares of the other two sides.

Tutorial Worksheet. 1. Find all solutions to the linear system by following the given steps. x + 2y + 3z = 2 2x + 3y + z = 4.

CS311 Computational Structures Regular Languages and Regular Grammars. Lecture 6

CS 275 Automata and Formal Language Theory

Finite Automata Theory and Formal Languages TMV027/DIT321 LP4 2018

Convert the NFA into DFA

CSE 332. Sorting. Data Abstractions. CSE 332: Data Abstractions. QuickSort Cutoff 1. Where We Are 2. Bounding The MAXIMUM Problem 4

6.5 Improper integrals

The Double Integral. The Riemann sum of a function f (x; y) over this partition of [a; b] [c; d] is. f (r j ; t k ) x j y k

Non-Deterministic Finite Automata. Fall 2018 Costas Busch - RPI 1

Arrow s Impossibility Theorem

Exercise sheet 6: Solutions

Part 4. Integration (with Proofs)

19 Optimal behavior: Game theory

Preview 11/1/2017. Greedy Algorithms. Coin Change. Coin Change. Coin Change. Coin Change. Greedy algorithms. Greedy Algorithms

Counting Paths Between Vertices. Isomorphism of Graphs. Isomorphism of Graphs. Isomorphism of Graphs. Isomorphism of Graphs. Isomorphism of Graphs

Chapter 4 State-Space Planning

Fast index for approximate string matching

Common intervals of genomes. Mathieu Raffinot CNRS LIAFA

CS:4330 Theory of Computation Spring Regular Languages. Equivalences between Finite automata and REs. Haniel Barbosa

p-adic Egyptian Fractions

(a) A partition P of [a, b] is a finite subset of [a, b] containing a and b. If Q is another partition and P Q, then Q is a refinement of P.

CS103B Handout 18 Winter 2007 February 28, 2007 Finite Automata

Solutions to Assignment 1

Connectivity in Graphs. CS311H: Discrete Mathematics. Graph Theory II. Example. Paths. Connectedness. Example

Introduction to Olympiad Inequalities

A Lower Bound for the Length of a Partial Transversal in a Latin Square, Revised Version

Arrow s Impossibility Theorem

The First Fundamental Theorem of Calculus. If f(x) is continuous on [a, b] and F (x) is any antiderivative. f(x) dx = F (b) F (a).

QUADRATIC EQUATION. Contents

Non Deterministic Automata. Linz: Nondeterministic Finite Accepters, page 51

Module 9: Tries and String Matching

Module 9: Tries and String Matching

Math Lecture 23

CS 573 Automata Theory and Formal Languages

Looking for All Palindromes in a String

CS5371 Theory of Computation. Lecture 20: Complexity V (Polynomial-Time Reducibility)

UniversitaireWiskundeCompetitie. Problem 2005/4-A We have k=1. Show that for every q Q satisfying 0 < q < 1, there exists a finite subset K N so that

A-Level Mathematics Transition Task (compulsory for all maths students and all further maths student)

BIFURCATIONS IN ONE-DIMENSIONAL DISCRETE SYSTEMS

Introduction to Bioinformatics

Ellipses. The second type of conic is called an ellipse.

First Midterm Examination

Connected-components. Summary of lecture 9. Algorithms and Data Structures Disjoint sets. Example: connected components in graphs

CS241 Week 6 Tutorial Solutions

State space systems analysis (continued) Stability. A. Definitions A system is said to be Asymptotically Stable (AS) when it satisfies

Nondeterministic Automata vs Deterministic Automata

1. Extend QR downwards to meet the x-axis at U(6, 0). y

1.4 Nonregular Languages

CS375: Logic and Theory of Computing

Where did dynamic programming come from?

5.7 Improper Integrals

Intermediate Math Circles Wednesday 17 October 2012 Geometry II: Side Lengths

Periodic string comparison

Is there an easy way to find examples of such triples? Why yes! Just look at an ordinary multiplication table to find them!

Strong Bisimulation. Overview. References. Actions Labeled transition system Transition semantics Simulation Bisimulation

Faster Regular Expression Matching. Philip Bille Mikkel Thorup

April 8, 2017 Math 9. Geometry. Solving vector problems. Problem. Prove that if vectors and satisfy, then.

Lecture 1 - Introduction and Basic Facts about PDEs

New Expansion and Infinite Series

Algorithm Design and Analysis

20 MATHEMATICS POLYNOMIALS

Finite Automata-cont d

Duality # Second iteration for HW problem. Recall our LP example problem we have been working on, in equality form, is given below.

AQA Further Pure 2. Hyperbolic Functions. Section 2: The inverse hyperbolic functions

CSC 473 Automata, Grammars & Languages 11/9/10

CH 9 INTRO TO EQUATIONS

(e) if x = y + z and a divides any two of the integers x, y, or z, then a divides the remaining integer

Bisimulation, Games & Hennessy Milner logic

HW3, Math 307. CSUF. Spring 2007.

Green s Theorem. (2x e y ) da. (2x e y ) dx dy. x 2 xe y. (1 e y ) dy. y=1. = y e y. y=0. = 2 e

Electromagnetism Notes, NYU Spring 2018

CS 491G Combinatorial Optimization Lecture Notes

Hyers-Ulam stability of Pielou logistic difference equation

1 From NFA to regular expression

For convenience, we rewrite m2 s m2 = m m m ; where m is repeted m times. Since xyz = m m m nd jxyj»m, we hve tht the string y is substring of the fir

Parse trees, ambiguity, and Chomsky normal form

DATA Search I 魏忠钰. 复旦大学大数据学院 School of Data Science, Fudan University. March 7 th, 2018

The Islamic University of Gaza Faculty of Engineering Civil Engineering Department. Numerical Analysis ECIV Chapter 11

Types of Finite Automata. CMSC 330: Organization of Programming Languages. Comparing DFAs and NFAs. NFA for (a b)*abb.

Finite Automata. Informatics 2A: Lecture 3. John Longley. 22 September School of Informatics University of Edinburgh

Types of Finite Automata. CMSC 330: Organization of Programming Languages. Comparing DFAs and NFAs. Comparing DFAs and NFAs (cont.) Finite Automata 2

Finite Automata. Informatics 2A: Lecture 3. Mary Cryan. 21 September School of Informatics University of Edinburgh

Intermediate Math Circles Wednesday, November 14, 2018 Finite Automata II. Nickolas Rollick a b b. a b 4

Transcription:

Computt onl Biology Leture 18 Genome Rerrngements Finding preserved genes We hve seen before how to rerrnge genome to obtin nother one bsed on: Reversls Knowledge of preserved bloks (or genes) Now we re onerned in determining the preserved genes, or more generlly, given two strings nd y, determine ll their possible miml mthes Globl lignment One possibility is to perform globl lignment of nd y with speil soring sheme (for instne +1, 0, 0) nd identify the miml positively soring hunks. Tkes O(mn) time Does not give ll ndidte mthes Might not give miml mthes Emple: = bbbbbb y =------bbb bbbbbb ------bbb missed non-miml, bbb is better

A simple wy: k-mers Another possibility is to find ommon k-mers. Here s one wy: Algorithm: Denote k-mer of by (w, 0, i) where w = i i+k-1 Denote k-mer of y by (z, 1, j) where z = y j y j+k-1 Sort them leiogrphilly Dedue ll k long mthes between nd y Emple: k-mers = bb, y = bbb s 3-mers: (b, 0, 1), (bb, 0, 2), (bb, 0, 3) y s 3-mers: (bb, 1, 1), (b, 1, 2), (b, 1, 3) sort them: (bb, 0, 2), (bb, 0, 3), (bb, 1, 1), (b, 1, 2), (b, 0, 1), (b, 1, 3) Identify mthes: (bb, 0, 3), (bb, 1, 1) nd (b, 0, 1), (b, 1, 3) Disdvntges Worst se running time still O(mn) e.g. O(m) k-mers in nd O(n) k-mers in y re identil. Thought: ren t we supposed to find these nywy nd, therefore, the O(mn) bound is not neessrily bd? No, they might be smll insignifint mthes, we re interested in miml mthes Mking k lrger redues the running time beuse it results in shortest list of mthes, but we might miss signifint mthes

A better solution Suffi tree We will use n effiient dt struture lled suffi tree tht stores ll suffies of string nd supports fst lookup Definition: A suffi tree T of string s of length m is rooted tree suh tht: It hs etly m leves numbered 1 to m Eh internl node other thn the root hs t lest two hildren Eh edge is lbeled by substring of s No two edge lbels out of node strt with the sme hrter For ny lef i, the ontention of the edge lbels on the pth from the root to i spells out the suffi i m s = b Emple of suffi tree 3 b b 5 b 4 2 1 6 Eistene Does suffi tree of string s lwys eist? Consider s = b 3 b 6 If suffi is prefi of nother suffi, then the pth for the first suffi would not end t lef! Solution: lwys terminte string with speil hrter $ tht does not our nywhere. b b 4 5 2 1

Properties of suffi tree A suffi tree stisfies the following: E = V 1 (tree) Number of leves = m + 1 (now s = m + 1) Sine eh internl node hs t lest two hildren, the number of edges E = O( leves) = O(m) Any sub-tree with k leves stisfies E = O(k) Building suffi tree Here s simple lgorithm: given = 1 m insert speil hrter $ t the end of s initilize the tree T to one root for j = 1 to m + 1 find the longest mth of j m in T strting from the root nd following unique pth split the edge where the mth stops, dd new node w dd n edge (w,j) (j is the new lef) nd lbel it with the remining unmthed hrters of j m s = b Emple b$ 7 $ b $ b$ b b $ b$ 4 $ $ $ $ $ $ 5 $ 3 2 $ 6 1

Anlysis Running time: O(m 2 ) Eh suffi requires O(m) time to updte the tree But, there eists n O(m) time suffi tree lgorithm Spe: O(m) How? Eh lbel hs O(m) hrters nd we hve E = O(m) lbels! Solution: do not epliitly store lbels, but store the indies [i,j] of lbel Now wht? How n we use the suffi tree dt struture to identify ll miml mthes between two strings nd y? Consider first the following problem: given string, determine ll lotions where nother string y ours. This n be solved effiiently s desribed net. Find ll ourrenes of y in string mthing Algorithm build suffi tree T for Mth the hrters of y long the unique pth in T until (se 1) either y is ehusted or (se 2) no more mthes re possible if (se 2) y does not our in else the k leves in the sub-tree below the point of the lst mth give the k lotion of y in (trverse the tree in liner time) O(m) O(n) O(1) O(k)

Corretness Why is the string mthing lgorithm orret? If y ours in t position i, then the i th suffi of must strt with y Therefore, lef i must be rehed by the pth determined by y Finding miml mthes Given nd y, we would like to find ll miml mthes between nd y i i+l = y j y j+l Cnnot etend i i+l nd y j y j+l nd obtin mth We will find ll mthes strting t y j tht nnot be etended to the right Build suffi tree for (do this only one) Find the pth in T determined by the longest possible prefi of the suffi y j y n (it ould stop in the middle of n edge e in tht se e is prt of the pth) Let v k, k = 1 p be n internl node on this pth nd T k be the sub-tree rooted t v k tht eludes v k+1 Identify the leves in eh sub-tree root Illustrtion L 1 v 1 L 2 Lst point of mth v 2 T1 v p-1 L p v p m 1 leves T 2 T p-1 T p A lef i in T k gives the lotion in of mth between i i+ L1 + + Lk -1 nd y j y j+ L1 + + Lk -1 tht nnot be etended to the right Running time: O(m) {building T} + O(Σ L k + Σm k ) O(n + m)

Wht bout left? Given lef i, let left(i) be the hrter i-1 If left(i) y j-1, then i represents miml mth Therefore, we obtin ll miml mthes between nd y in O(mn) time by repeting the previous lgorithm for every suffi of y Algorithm Build suffi tree T for O(m) for j = 1 to n find the pth in T determined by the longest possible prefi of the suffi y j y n (it ould stop in the middle of n edge e in tht se e is prt of the pth) O(n) let v k, k = 1 p be n internl node on this pth nd T k be the sub-tree rooted t v k tht eludes v k+1 Let l(v k ) = length of mth up to node v k identify ll leves i in eh sub-tree suh tht left(i) y j-1 Suh lef i in sub-tree T k represents miml mth of length l(v k ) strting t position i in nd position j in y O(m) Generlized suffi tree for set of strings We n build suffi tree for set of strings s 1, s 2,, s n Append different end of string mrker to eh string in the set ontente ll the strings together build suffi tree for the ontented string The resulting suffi tree will hve lef for eh suffi of the ontented string nd is build in time proportionl to the sum of ll lengths The lef numbers n be esily onverted to two numbers, one identifying string s i nd the other strting position in s i

Emple s 1 = b, s 2 = bbb s = b$bbb 1,6 $bbb 2,7 bb 2,1 b 2,5 2,3 b $bbb 1,3 2,8 $bbb b b$bbb 2,4 $bbb b $bbb b 1,5 2,2 1,2 1,4 1,1 Fi lbels of lef edges One defet is tht the tree now represents suffies tht spn more thn one originl string Beuse eh string mrker ours only one, the unwnted suffies re removed by fiing the lbel on lef edges 1,6 $bbb 2,7 bb 2,1 b 2,5 2,3 b $bbb 1,3 2,8 $bbb b b$bbb 2,4 $bbb b $bbb b 1,5 2,2 1,2 1,4 1,1 Suffi tree for nd y Therefore, given two strings nd y, we n build suffi tree for both in O(m + n) (i.e. liner) time. Eh lef in the tree represents Either suffi from Or suffi from y Mrk eh internl node v with (y) if there is lef in the sub-tree of v representing suffi from (y). This n be done in liner time by bottom up trversl of the tree from leves to the root. Note tht if v is mrked (y), ll nestors of v re mrked (y).

Common substrings If αp is substring of nd αq is substring of y for p q, then α orresponds to n internl node v mrked with both nd y nd vie-vers. Proof: α ours in both nd y suh tht the hrter to the right of α in differs from the hrter to the right of α in y. α,y lef for lef for y onversely, every internl node mrked with both nd y hs to stisfy the sitution depited bove, then αp is substring of nd αq is substring of y for p q. Left diverse node An internl node v is left diverse iff it hs two hildren v 1 nd v 2 with lef i for in v 1 s sub-tree nd lef j for y in v 2 s sub-tree, suh tht left(i) left(j) (ssume 0 nd y 0 re different nd distint from ny other hrter) If uαp is substring of nd wαq is substring of y for u w nd p q, then α orresponds to left diverse node v nd vie-vers. Proof: similr to previous proof Cll suh n α miml ommon substring Compt representtion Therefore, we hve only O(m + n) miml ommon substrings for nd y (but eh miml ommon substring might pper in multiple lotions) If we identify left diverse nodes in liner time, we need only O(m + n) time nd spe to ome up with this ompt representtion of ll miml ommon substrings A miml mth n be represented s (p 1, p 2, l) where p 1 nd p 2 re the positions of miml ommon substring of length l in nd y respetively We n obtin ll miml mthes in O(m + n + k) where k is their number (we will not present the lgorithm)

Identifying left diverse nodes For eh node the lgorithm reords: the hrter (v): the left hrter of every lef for in v s sub-tree, or speil hrter ε if no lef for eists in v s sub-tree, or speil hrter & the hrter b(v): the left hrter of every lef for y in v s sub-tree, or speil hrter ε if no lef for y eists in v s sub-tree, or speil hrter @ Computing (v) nd b(v) n be done in bottom up pproh in liner time Note tht v is left diverse iff it hs two hildren v 1 nd v 2 with: (v 1 ) b(v 2 ), (v 1 ) ε, b(v 2 ) ε or b(v 1 ) (v 2 ), b(v 1 ) ε, (v 2 ) ε It tkes O( Σ 2 ) time (onstnt) to find two suh hildren or none, where Σ is the lphbet (eh node hs t most Σ hildren)