Common intervls of genomes Mthieu Rffinot CNRS LIF
Context: omprtive genomis. set of genomes prtilly/totlly nnotte Informtive group of genes or omins? Ex: COG tse
Mny iffiulties! iology Wht re two similr genes? Wht out lterntive spliing? When re two genes lose (notion of istne)? Wht is n interesting luster? sis: pressure seletion > keep genes working together lose How to moel lusters? Grphs / strings? How to ompute those lusters? How to mnge the sets of lusters n extrt useful informtion? Computer siene
One of the simplest moel : genomes s strings of units ommon intervls Simplest se in this moel: 2 genomes! E C Common intervl: one intervl on eh hromosome sme set of gene in eh intervl externls ouns not in the set of gene
E C E C E C
E C E C E C
How mny ommon intervls? X first hromosome, X= x 1 x 2.. x n Y seon hromosome, Y= y 1 y 2.. y m Common lphet, <= mx( X, Y ) Y C Y= y 1 y 2 y m fo(y,1)= C fo(y,2) = C fo(y,3) = C fo (Y,4) = C fo (Y,5) = = 1 = 2 = 3 C = 4 = 1 = 2 C = 3 = 1 C= 2 = 3 C = 1 = 2 =1 Rnk (Y,1) []=3
Int[k] 3 2 1 E Y C Y= y 1 y 2 y m fo(y,1) = C = 1 =2 C = 3 Rnk (Y,2) []=2
Int[k] re neste! They form tree.! 3 2 1 E 2 n vli Int[k] t mx! 2 nm ommon intervls t mximum The oun is rehe!!
How to ientify ll them? Two pprohes iret omputtion (iier) O(nm) ut + Lowest ommon nestor (otherwise O(n m logn) + No struture in the output! + Complexity oes not epen of the input + No inex Fingerprint omputtion on single string + inex+ merge fter + O(n+ L 1 log n + m L 2 log m) (n e worst thn iier) + Struture in the output n possiility of serh of fingerprint + Complexity oes epen of the input + Keep the inex for further omputtions
S = s 1..s N string of length n lphet of size, not fixe (possily O(n)) fingerprint f : set of hrter(s) of sustring s i.. s j Generl prolem: Compute n represent the set of ll fingerprints of S Exmples: {} {} {} {} {,} {,} {,} {,,} {,,} {,,,} {} {} {} {} {,} {,} {,} {,} {,} {,,} {,,} {,,} {,,,}
Mximl lotion <i,j> of f i fingerprint f j not in f, not in f + Numer of mximl lotions: L <= n Complexity of the oun esily rehe ut is usully muh less k = { 1, 2,.., k } w 1 = 1, w k = w k 1 k w k 1 w 2 = 1 ( 2 ) 1, w 3 =( 1 2 1 ) 3 ( 1 2 3 ),... w k. L k = k. (2 k 1) L k = 2 k+1 (k+2) L k =o( w k. L k )
Nming tehnique {,,e,f} = {,,,,e,f,g,h} log +1 e f g h {,,e,f,g} {,,e,g} Nmes = {[1],[2],[3],[4],[5],[6],[7],[8],[9],[10]} Fingerprints ={[7],[9],[10]}
mir, postolio, Lnu, Stt 2003 k istint hrters Chnging hrter: O(log log n) (n new nmes mximum y level) One itertion: n log log n itertions: n log log n Importnt: ifferent set of nmes for eh itertion k=2
Tsur 2005 List of fingerprints: 1 {}, {,}, {,,}, {,}, {,,} {([0],[1]), } {([1],[1]), } {([1],[0]), } 1 1 {([1],[0]), } {([1],[1]), } List of hnges: {([0],[0]), } {([0,0]), } {([0],[1]), } {([1],[1], } {([1],[0]), } {([1],[0]), } {([1],[1]), } Rix sort on the pirs + unique > new nmes
Tsur 2005 List of hnges: {([0],[0]), } {([0],[0]), } {([0],[1]), } {([1],[1], } {([1],[0]), } {([1],[0]), } {([1],[1]), } [2] > ([0],[0]) [3] > ([0],[1]) [4] > ([1],[0]) [5] > ([1],[1]) New list: {[2], } {[2], } {[3], } {[5], } {[4], } {[4], } {[5], } {([2],[2]), C} {([2],[3]),C} New list: {([2],[2]),C} {([2],[3]),C} {([2],[5]),C} {([4],[5]),C} {([4],[4]),C} {([5],[4]),C} Rix sort,...
Tsur 2005 Rix sort: O(n) (oune integers) One itertion : n log No more nme serh! itertions: n log Prolems oes not epen of L istint nmes t eh itertion
Our pproh (2006) Simple sequene: no repete hrter lfo(i) e lfo(4)=e e lfo(2) = e Contente # to the sequene ijetion L / proper prefixes of lfo(i) e e # e # Compute ll lfo(i) of S#
Our pproh (2006) How to lulte ll lfo(i)? lfo(i) # # # # # # # # # # # # #
Our pproh (2006) Nming ll proper prefixes of lfo(i) n lists: Tsur lgorithm Common nmes Simple sequene: O( L log ) Generl sequene: O(n+ L log ) L <= n Fster or s fst s tht of Tsur
Our pproh (2006) Properties n opertions on our nmes unique set of nmes Compute the LCP of two fingerprints in log nmes sorte y lexiogrphi orer of fingerprints
Fingerprint trie Chn et l, ES 2007 O( F ) spe O( F log ) time Serh in O( f log( f / ))
k to ommon intervls: 1) uil the tree for the first sequene: O(n+ L 1 log ) 2) uil the tree for the seon sequene: O(m+ L 2 log ) 3) Merge the two trees! Complexity: O((n+m)+( L 1 + L 2 ) log ) time.
Open prolems Memory spe reution Orer? pproximte fingerprint istne y fingerprints 2 fingerprints