Exc Mching
exc mching: pics exc mching serch pern P in ex T (P,T srings) Knuh Mrris Pr preprcessing pern P Ah Crsick pern f severl srings P = { P 1,, P r } Suffix Trees preprcessing ex T r severl exs dse
(A) preprcessing perns Knuh-Mrris-Pr Ah-Crsick
KMP exmple 1 2 3 4 5 6 7 8 0 1 1 2 2 3 4 3 filure links (suffix = prefix) p: s umn m: s le hw deermine? hw use? 6 3 7 4 8... 3
KMP cmpuing filure links filure link ~ new es mch (fer mismch) òr 0 k-1 k Flink[1] = 0; fr k frm 2 PLen d fil = Flink[k-1] while ( fil>0 nd P[fil] P[k-1] ) d fil = Flink[fil]; d Flink[k] = fil+1; d
prefixes vi filure links P r k Flink[k]=r P r P 1 P r-1 = P k-r+1 P k-1 mximl r<k ll such vlues r: r 4 r 3 r 2 r 1 k P 1 P r2-1 = P k-r2+1 P k-1 = P r1-r2+1 P r1-1 Flink[r 1 ]=r 2
her mehds Byer-Mre T = mrkkpmn P = schenveer sche wrk ckwrds Krp-Rin fingerprin fingerprin i-1 i i+n-1 i+n p 1 p n hsh-vlue i B n-1 + i+1 B n-2 + i+n-1 B 0 i+1 B n-1 + + i+n-1 B 1 + i+n B 0
exc mching wih se f perns P = { P 1,, P r } ll ccurrences in ex T l lengh m lengh n AHO CORASICK generlizes KMP filure links lnges suffix h is prefix (perhps in nher sring) > n suwrds wihin P
keywrd ree - rie edges ~ leers e p e r r s y i c e n h c { p, pery, pery, science, schl } l 1 2 3 4 5 1 y e 2 5 3 4 leves ~ keywrds
filure links p h e h e r e { p,, heer, her } r p her p filure links in her rnches!
lgrihm: fllw he links exising new edge wih incming fllw links sring pren unil uging is fund
filure links p h e h e r e { p,, heer, her } r p her heer p redh firs (level-y-level)
filure links p e r h { p,, heer, her } h e r e r child r [single leer] shrcus
(B) preprcessing ex
rie vs. suffix ree sring+suffixes rie suffix ree www.cs.helsinki.fi/u/ukknen/erice2005.pp
rie vs. ree Trie(T) = O( T ) 2 qudric d exmple: T = n n Trie(T) like DFA fr he suffixes f T minimize DFA direced cyclic wrd grph nly rnching ndes nd leves represened edges leled y susrings f T crrespndence f leves nd suffixes T leves, hence < T inernl ndes Tree(T) = O( T + size(edge lels)) liner
niygriy niygriy iygriy ygriy ygriy ygriy griy riy iy y y y 1 2 3 4 5 6 7 8 9 10 11 1 niygriy griy 2 8 griy y iy y y griy griy griy 3 9 4 10 6 5 11
niygriy niygriy iygriy ygriy ygriy ygriy griy riy iy y y y 1 2 3 4 5 6 7 8 9 10 11 1 niygriy 1-11 griy 6-11 2 8 iy y 2-5 4-5 griy 6-11 griy 6-11 y 5-5 6 3-3 y 5-5 griy 6-11 griy 6-11 3 9 4 10 5 11 implemenin: refer psiins
liner ime cnsrucin niygriy iygriy ygriy ygriy ygriy griy riy iy y y y Weiner (1973) lgrihm f he yer McCreigh (1976) n-line lgrihm (Ukknen 1992)
suffix rie fr suffix links nex syml = frm here lredy exiss
pplicin: full ex index T ps P ps P in T P is prefix f suffix f T P suree under P ~ lcins f P ps ps
exmple: find i in niygriy niygriy iygriy ygriy ygriy ygriy griy riy iy y y y 1 2 3 4 5 6 7 8 9 10 11 1 niygriy griy 2 8 iy griy y griy y 6 y griy griy 3 9 4 10 5 11 psiins
pplicin: lnges cmmn susring T P pples ple T ps ps P generlized suffix ree (mrk T nd T suffixes) ps ps
pplicin: cuning mifs niygriy iygriy ygriy ygriy ygriy griy riy iy y y y 1 2 3 4 5 6 7 8 9 10 11 1 niygriy griy 2 8 iy y 2 griy 2 griy y 6 4 griy y griy 2 3 9 4 10 5 11
mif : repes in DNA s repred y Ukknen humn chrmsme 3 he firs 48 999 930 ses 31 min cpu ime (8 prcessrs, 4 GB) humn genme: 3x10 9 ses suffix ree fr Humn Genme fesile
lnges repe? Occurrences : 28395980, 28401554r Lengh: 2559 gggcggcccggcgggcgccggccgggggcgccccccgcgcg ggcccggccccccccccccccccccccccgccccggggggccccccggccggcc gccccccggggcgcggggggccgcgggcgggggccgcccc ccccggcgcccggcgcgccgggggcccccccgccccgg gccgggggccgcgcggggccgcccgggcggcgcgcgcc gggcccggggggcgggcggcgcgcccgggcccccgccccgg gcgcgcccgccgcccccccccccgcccggccgcgcgccccg ggggggccgggggcccgggccgggggccggggcgcgc cggggcgccccgccccgggggggcgggggcggcggg gcccgcggggggccccccggggccgcccgggggccgcggcggcc ggcccgcggcggccgcggggcggccgcccgccgccgg ggccggccgggggggccggccccggggggg ggggggggggggggccgcgcccggcg ccgcccgcccgggccccccgcggcgggcgcgggggcgg ccggggccgcgccggcccgggccgccgcgggcggccggg ggcgggcggggccgcgcggcgggcggcggggccggccgcg gcccgggcgggcgggggggcgccccgggcgggcccc gccccccggcggcgcccggccccggcgggggcccgggg cccccccggggccggccggcggggggcccggcccggcg gggggcgggccggccggcgcgggccgcggggggcg gcgggggcgccgccgccgggcgccccccgcccgccccc cgccggcccggccgcccccgggggggggggggccccgcggccgcggg gcccggcccggggcgggggcggccggccccccc ggggcgggcggcggcccgccggcgggcgcggcg gcgggcggcgggccgccgccccgggggccccgcgggggcgg gcgcggcgggccggggcgccggccggggcccgggccg cggcggcggggcggcccgggg
en ccurrences? ggcgggccgccgcgcccggcggggcgg gcgggccggcccgcgcccgcccccgggccgccc ccgcccgcccccggcgggccggcgcccgccccg cccggcggggcggggcccggccgg gggccgcccgcccggccgcccgcccggcccccg gcgggcggcg Lengh: 277 Occurrences : 10130003, 11421803, 18695837, 26652515, 42971130, 47398125 In he reversed cmplemen : 17858493, 41463059, 42431718, 42580925
finlly suffix ree efficien (liner) srge, u cnsn ±40 lrge verhed suffix rry hs cnsn ±5 hence mre prcicl u hs is wn cmplicins nïve n lg(n) lgrihm n d
suffix rry niygriy iygriy ygriy ygriy ygriy griy riy iy y y y 1 2 3 4 5 6 7 8 9 10 11 griy iy iygriy niygriy riy y ygriy y ygriy y ygriy 6 8 2 1 7 9 3 10 4 11 5 lexicgrphic rder f he suffixes
surces Dn Gusfield Algrihms n Srings, Trees, nd Sequences Cmpuer Science nd Cmpuinl Bilgy liss mny pplicins fr suffix rees (nd exended implemenin deils) slides n suffix-rees sed n/cpied frm Esk Ukknen, Univ Helsinki (Erice Schl, 30 Oc 2005)