Computt onl Biology Leture 18 Genome Rerrngements Finding preserved genes We hve seen before how to rerrnge genome to obtin nother one bsed on: Reversls Knowledge of preserved bloks (or genes) Now we re onerned in determining the preserved genes, or more generlly, given two strings nd y, determine ll their possible miml mthes Globl lignment One possibility is to perform globl lignment of nd y with speil soring sheme (for instne +1, 0, 0) nd identify the miml positively soring hunks. Tkes O(mn) time Does not give ll ndidte mthes Might not give miml mthes Emple: = bbbbbb y =------bbb bbbbbb ------bbb missed non-miml, bbb is better
A simple wy: k-mers Another possibility is to find ommon k-mers. Here s one wy: Algorithm: Denote k-mer of by (w, 0, i) where w = i i+k-1 Denote k-mer of y by (z, 1, j) where z = y j y j+k-1 Sort them leiogrphilly Dedue ll k long mthes between nd y Emple: k-mers = bb, y = bbb s 3-mers: (b, 0, 1), (bb, 0, 2), (bb, 0, 3) y s 3-mers: (bb, 1, 1), (b, 1, 2), (b, 1, 3) sort them: (bb, 0, 2), (bb, 0, 3), (bb, 1, 1), (b, 1, 2), (b, 0, 1), (b, 1, 3) Identify mthes: (bb, 0, 3), (bb, 1, 1) nd (b, 0, 1), (b, 1, 3) Disdvntges Worst se running time still O(mn) e.g. O(m) k-mers in nd O(n) k-mers in y re identil. Thought: ren t we supposed to find these nywy nd, therefore, the O(mn) bound is not neessrily bd? No, they might be smll insignifint mthes, we re interested in miml mthes Mking k lrger redues the running time beuse it results in shortest list of mthes, but we might miss signifint mthes
A better solution Suffi tree We will use n effiient dt struture lled suffi tree tht stores ll suffies of string nd supports fst lookup Definition: A suffi tree T of string s of length m is rooted tree suh tht: It hs etly m leves numbered 1 to m Eh internl node other thn the root hs t lest two hildren Eh edge is lbeled by substring of s No two edge lbels out of node strt with the sme hrter For ny lef i, the ontention of the edge lbels on the pth from the root to i spells out the suffi i m s = b Emple of suffi tree 3 b b 5 b 4 2 1 6 Eistene Does suffi tree of string s lwys eist? Consider s = b 3 b 6 If suffi is prefi of nother suffi, then the pth for the first suffi would not end t lef! Solution: lwys terminte string with speil hrter $ tht does not our nywhere. b b 4 5 2 1
Properties of suffi tree A suffi tree stisfies the following: E = V 1 (tree) Number of leves = m + 1 (now s = m + 1) Sine eh internl node hs t lest two hildren, the number of edges E = O( leves) = O(m) Any sub-tree with k leves stisfies E = O(k) Building suffi tree Here s simple lgorithm: given = 1 m insert speil hrter $ t the end of s initilize the tree T to one root for j = 1 to m + 1 find the longest mth of j m in T strting from the root nd following unique pth split the edge where the mth stops, dd new node w dd n edge (w,j) (j is the new lef) nd lbel it with the remining unmthed hrters of j m s = b Emple b$ 7 $ b $ b$ b b $ b$ 4 $ $ $ $ $ $ 5 $ 3 2 $ 6 1
Anlysis Running time: O(m 2 ) Eh suffi requires O(m) time to updte the tree But, there eists n O(m) time suffi tree lgorithm Spe: O(m) How? Eh lbel hs O(m) hrters nd we hve E = O(m) lbels! Solution: do not epliitly store lbels, but store the indies [i,j] of lbel Now wht? How n we use the suffi tree dt struture to identify ll miml mthes between two strings nd y? Consider first the following problem: given string, determine ll lotions where nother string y ours. This n be solved effiiently s desribed net. Find ll ourrenes of y in string mthing Algorithm build suffi tree T for Mth the hrters of y long the unique pth in T until (se 1) either y is ehusted or (se 2) no more mthes re possible if (se 2) y does not our in else the k leves in the sub-tree below the point of the lst mth give the k lotion of y in (trverse the tree in liner time) O(m) O(n) O(1) O(k)
Corretness Why is the string mthing lgorithm orret? If y ours in t position i, then the i th suffi of must strt with y Therefore, lef i must be rehed by the pth determined by y Finding miml mthes Given nd y, we would like to find ll miml mthes between nd y i i+l = y j y j+l Cnnot etend i i+l nd y j y j+l nd obtin mth We will find ll mthes strting t y j tht nnot be etended to the right Build suffi tree for (do this only one) Find the pth in T determined by the longest possible prefi of the suffi y j y n (it ould stop in the middle of n edge e in tht se e is prt of the pth) Let v k, k = 1 p be n internl node on this pth nd T k be the sub-tree rooted t v k tht eludes v k+1 Identify the leves in eh sub-tree root Illustrtion L 1 v 1 L 2 Lst point of mth v 2 T1 v p-1 L p v p m 1 leves T 2 T p-1 T p A lef i in T k gives the lotion in of mth between i i+ L1 + + Lk -1 nd y j y j+ L1 + + Lk -1 tht nnot be etended to the right Running time: O(m) {building T} + O(Σ L k + Σm k ) O(n + m)
Wht bout left? Given lef i, let left(i) be the hrter i-1 If left(i) y j-1, then i represents miml mth Therefore, we obtin ll miml mthes between nd y in O(mn) time by repeting the previous lgorithm for every suffi of y Algorithm Build suffi tree T for O(m) for j = 1 to n find the pth in T determined by the longest possible prefi of the suffi y j y n (it ould stop in the middle of n edge e in tht se e is prt of the pth) O(n) let v k, k = 1 p be n internl node on this pth nd T k be the sub-tree rooted t v k tht eludes v k+1 Let l(v k ) = length of mth up to node v k identify ll leves i in eh sub-tree suh tht left(i) y j-1 Suh lef i in sub-tree T k represents miml mth of length l(v k ) strting t position i in nd position j in y O(m) Generlized suffi tree for set of strings We n build suffi tree for set of strings s 1, s 2,, s n Append different end of string mrker to eh string in the set ontente ll the strings together build suffi tree for the ontented string The resulting suffi tree will hve lef for eh suffi of the ontented string nd is build in time proportionl to the sum of ll lengths The lef numbers n be esily onverted to two numbers, one identifying string s i nd the other strting position in s i
Emple s 1 = b, s 2 = bbb s = b$bbb 1,6 $bbb 2,7 bb 2,1 b 2,5 2,3 b $bbb 1,3 2,8 $bbb b b$bbb 2,4 $bbb b $bbb b 1,5 2,2 1,2 1,4 1,1 Fi lbels of lef edges One defet is tht the tree now represents suffies tht spn more thn one originl string Beuse eh string mrker ours only one, the unwnted suffies re removed by fiing the lbel on lef edges 1,6 $bbb 2,7 bb 2,1 b 2,5 2,3 b $bbb 1,3 2,8 $bbb b b$bbb 2,4 $bbb b $bbb b 1,5 2,2 1,2 1,4 1,1 Suffi tree for nd y Therefore, given two strings nd y, we n build suffi tree for both in O(m + n) (i.e. liner) time. Eh lef in the tree represents Either suffi from Or suffi from y Mrk eh internl node v with (y) if there is lef in the sub-tree of v representing suffi from (y). This n be done in liner time by bottom up trversl of the tree from leves to the root. Note tht if v is mrked (y), ll nestors of v re mrked (y).
Common substrings If αp is substring of nd αq is substring of y for p q, then α orresponds to n internl node v mrked with both nd y nd vie-vers. Proof: α ours in both nd y suh tht the hrter to the right of α in differs from the hrter to the right of α in y. α,y lef for lef for y onversely, every internl node mrked with both nd y hs to stisfy the sitution depited bove, then αp is substring of nd αq is substring of y for p q. Left diverse node An internl node v is left diverse iff it hs two hildren v 1 nd v 2 with lef i for in v 1 s sub-tree nd lef j for y in v 2 s sub-tree, suh tht left(i) left(j) (ssume 0 nd y 0 re different nd distint from ny other hrter) If uαp is substring of nd wαq is substring of y for u w nd p q, then α orresponds to left diverse node v nd vie-vers. Proof: similr to previous proof Cll suh n α miml ommon substring Compt representtion Therefore, we hve only O(m + n) miml ommon substrings for nd y (but eh miml ommon substring might pper in multiple lotions) If we identify left diverse nodes in liner time, we need only O(m + n) time nd spe to ome up with this ompt representtion of ll miml ommon substrings A miml mth n be represented s (p 1, p 2, l) where p 1 nd p 2 re the positions of miml ommon substring of length l in nd y respetively We n obtin ll miml mthes in O(m + n + k) where k is their number (we will not present the lgorithm)
Identifying left diverse nodes For eh node the lgorithm reords: the hrter (v): the left hrter of every lef for in v s sub-tree, or speil hrter ε if no lef for eists in v s sub-tree, or speil hrter & the hrter b(v): the left hrter of every lef for y in v s sub-tree, or speil hrter ε if no lef for y eists in v s sub-tree, or speil hrter @ Computing (v) nd b(v) n be done in bottom up pproh in liner time Note tht v is left diverse iff it hs two hildren v 1 nd v 2 with: (v 1 ) b(v 2 ), (v 1 ) ε, b(v 2 ) ε or b(v 1 ) (v 2 ), b(v 1 ) ε, (v 2 ) ε It tkes O( Σ 2 ) time (onstnt) to find two suh hildren or none, where Σ is the lphbet (eh node hs t most Σ hildren)