Aho-Corsick Auom
Sring D Srucures Over he nex few dys, we're going o be exploring d srucures specificlly designed for sring processing. These d srucures nd heir vrins re frequenly used in prcice
Looking Forwrd Tody: Aho-Corsick Auom A fs d srucure for sring mching. Thursdy: Suffix Trees An bsurdly versile sring d srucure. Tuesdy: Suffix Arrys Suffix-ree like performnce wih rry-like spce usge.
Sring Serching
The Sring Serching Problem Consider he following problem: Given sring T nd k nonempy srings P₁,, Pₖ, find ll occurrences of P₁,, Pₖ in T. T is clled he ex sring nd P₁,, Pₖ re clled pern srings. This problem ws originlly sudied in he conex of compiling indexes, bu hs found pplicions in compuer securiy nd compuionl genomics.
Pern Srings b b o e b e b e d e d g g e u e b e d g e b
Some Terminology Le m = T, he lengh of he sring o be serched. Le n = P₁ + P₂ + + Pₖ be he ol lengh of ll he pern srings. Le Lmx be he lengh of he longes pern sring. Assume h srings re drwn from n lphbe Σ, where Σ is some consn. We'll use hese erms when lking bou he runime of he lgorihms nd d srucures we'll explore over he nex couple of dys.
How quickly cn we solve he sring serching problem?
Le's sr wih nïve pproch.
Pern Srings b b o u e b e b e d e d g e g e For For ech ech posiion posiion in in T: T: For For ech ech pern pern sring sring Pᵢ: Pᵢ: Check Check if if Pᵢ Pᵢ ppers ppers h h posiion. posiion. b e d g e b
Anlyzing Our Approch As before, le m be he lengh of he ex nd n he ol lengh of he pern srings. For ech chrcer of he ex sring T, in he wors cse, we scn over ll n ol chrcers in he perns. Time complexiy: O(mn). Is his igh bound?
Θ(mn) Pern Srings
Cn we do beer?
Pern Srings b b o u e b e b e d e d g e g e b e d g e
Prllel Serching Ide: Rher hn serching he pern srings in seril, ry serching hem in prllel. Inuiively, his should cu down on lo of he unnecessry rescnning h we're doing. Chllenge: How excly do we do his in prcice?
g Pern Srings b b o u e b e b e d e d g e g e o u b e b e d e d g e e This This d d srucure srucure is is clled clled rie. rie. I I comes comes from from he he word word rerievl. rerievl. I I is is no no pronounced pronounced like like rerievl. rerievl.
Represening Tries Ech rie node needs o sore poiners o is children. There re mny differen d srucures we could use o sore hese poiners. For ody, we'll ssume we hve n rry of Σ poiners, one per possible child. You'll explore vrins on his sregy in he problem se. c
Represening Tries Ech rie node needs o sore poiners o is children. There re mny differen d srucures we could use o sore hese poiners. For ody, we'll ssume we hve n rry of Σ poiners, one per possible child. You'll explore vrins on his sregy in he problem se.
e b b o b e b e d e d g g Pern Srings e e u b o u e b e d e d g e g e b e d g e
Anlyzing our New Algorihm Le's suppose we've lredy consruced he rie. How much work is required o perform he mch? For ech chrcer of T, we inspec s mos s mny chrcers s exis in he deepes brnch of he rie. Time complexiy: O(mLmx ), where L mx is he lengh of he longes pern sring. (Do you see why?) In he (resonble) cse where Lmx is much smller hn n, his is huge win over before. If L mx is objecively smll, his is prey good runime. How much ime does i ke o build he rie?
Building Trie Clim: Given se of srings P₁,, Pₖ of ol lengh n, i's possible o build rie for hose srings in ime Θ(n). e e b n e b n e
Our Sregies Following our fory ino RMQ, we'll sy h soluion o muli-sring mching runs in ime p(m, n), q(m, n) if he preprocessing ime is p(m, n) nd he mching ime is q(m, n). We now hve wo pproches: No preprocessing: O(1), O(mn). Trie serching: O(n), O(mLmx ). Cn we do beer?
Pern Srings r o r s s o r r o r s s o r s o r s
Pern Srings r o r s s o r r o r s s o r s o r s
Pern Srings r o r s s o r r o r s s o r s o r s This This red red link link is is clled clled suffix suffix link. link. We'll We'll lk lk bou bou hem hem more more formlly formlly in in minue. minue.
Pern Srings r o r s s o r r o r s s o r s o
Pern Srings r o r s s o r r o r s s o r s o
Pern Srings r o r s s o r r o r s s o r s o
Pern Srings r o r s s o r r o r s s o r s o
Pern Srings r o r s s o r r o r s s o r s o r s o r s In In generl, generl, suffix suffix links links migh migh jump jump he he red red cursor cursor forwrd forwrd more more hn hn one one sep. sep. The The number number of of seps seps ken ken is is equl equl o o he he chnge chnge of of deph deph in in he he rie. rie.
Pern Srings r o r s s o r r o r s s o r s o
Pern Srings r o r s s o r r o r s s o r s o
Suffix Links A suffix link (someimes clled filure link) is red edge from rie node corresponding o sring α o he rie node corresponding o sring ω such h ω is he longes proper suffix of α h is sill in he rie. Inuiion: When we hi pr of he sring where we cnno coninue o red chrcers, we fll bck by following suffix links o ry o preserve s much conex s possible. Every node in he rie, excep he roo (which corresponds o he empy sring ε), will hve suffix link ssocied wih i.
Why Suffix Links Mer Suffix links cn subsnilly improve he performnce of our sring serch. A ech sep, we eiher dvnce he blck (end) poiner forwrd in he rie, or dvnce he red (sr) poiner forwrd. Ech poiner cn dvnce forwrd mos O(m) imes. This reduces he moun of ime spen scnning chrcers from O(mL mx ) down o Θ(m). This is only useful if we cn compue suffix links quickly... which we'll see how o do ler.
A Problem wih our Opimizion
i n Pern Srings i i n i n s i n g i n s i n g s i n g
Wh Hppened? Our hevily opimized sring sercher no longer srs serching from ech posiion in he sring. As resul, we now migh forge o oupu mches in cerin cses. We need o figure ou when his hppens, nd how o correc for i.
i n Pern Srings i i n i n s i n g i n s i n g s i n g We We missed missed he he pern pern sring sring i i becuse becuse i's i's proper proper suffix suffix si. si.
i n Pern Srings i i n i n s i n g i n s i n g s i n g We missed boh in nd in We missed boh in nd in becuse becuse ech ech is is proper proper suffix suffix of of sin. sin.
How do we ddress his?
i n Pern Srings i i n i n s i n g i n s i n g This This blue blue rrow rrow is is clled clled n n oupu oupu link. link. Whenever Whenever we we visi visi his his gold gold node, node, we'll we'll oupu oupu he he sring sring represened represened by by he he node node he he end end of of he he blue blue rrow. rrow.
i n Pern Srings i i n i n s i n g i n s i n g By By precompuing precompuing where where we we evenully evenully need need o o end end up, up, we we cn cn insnly insnly red red off off ny ny exr exr perns perns o o emi emi his his poin. poin. As As you'll you'll see, see, we we cn cn precompue precompue hese hese links links relly relly quickly! quickly!
i n Pern Srings i i n i n s i n g i n s i n g Even Even nodes nodes h h hemselves hemselves correspond correspond o o re re perns perns migh migh need need oupu oupu links links if if oher oher perns perns lso lso end end he he corresponding corresponding sring. sring.
i n Pern Srings i i n i n s i n g i n s i n g Noice Noice h h he he blue blue edges edges here here form form linked linked lis. lis. If If we we visi visi his his node, node, we we need need o o oupu oupu everyhing everyhing in in he he chin, chin, no no jus jus he he in in node node we're we're immediely immediely poining poining..
The Finl Mching Algorihm Sr he roo node in he rie. For ech chrcer c in he sring: While here is no edge lbeled c: If you're he roo, brek ou of his loop. Oherwise, follow suffix link. If here is n edge lbeled c, follow i. If he curren node corresponds o pern, oupu h pern. Oupu ll words in he chin of oupu links origining his node.
The Runime Impc
Pern Srings
The Runime In he wors cse, we my hve o spend huge moun of ime lising off ll he mches in he sring. This isn' he ful of he lgorihm ny lgorihm h mches srings his wy would hve o spend he ime reporing mches. To ccoun for his, le z denoe he number of mches repored by our lgorihm. The runime of he mch phse is hen Θ(m + z), wih he m erm coming from he sring scnning nd he z erm coming from he mches. You someimes her lgorihms whose runime depends on how much oupu is genered referred o s oupu-sensiive lgorihms.
Where We Are Given he mching uomon (which is clled n Aho-Corsick uomon or n AC uomon), we cn find ll occurrences of he pern srings in ny ex of lengh m in ime Θ(m+z). To see wheher his is worhwhile, we need o see how quickly we cn build he uomon.
Time-Ou for Announcemens!
Problem Se One As friendly reminder, Problem Se One is due his Thursdy 3:00PM. All soluions mus be submied elecroniclly hrough GrdeScope. We srongly recommend leving few hours' buffer ime so h you cn ge everyhing se up properly. If you hven' sred ye... you probbly should go nd do h. We've go office hours hroughou he week if you hve quesions nd you're welcome o sk quesions on Pizz.
HckOverflow Snford WiCS is hosing HckOverflow, hckhon for progrmmers of ll skill levels. I's coming up on Surdy, April 16 from 10AM 10PM. Everyone is welcome! Highly recommended! If you've never been o hckhon before, his is one of he bes plces o sr. Wn o end? RSVP using his link. Wn o voluneer he even or serve s menor? RSVP his link.
ostem Mixer Snford's chper of ostem (Ou in STEM) is hosing mixer even omorrow, April 6, 6PM he LGBT-CRC. Ineresed in ending? Wn o ge involved in ostem ledership? Feel free o sop on by! Everyone is welcome. If you'd like o RSVP, you cn use his link.
Bck o CS166!
Building he Aho-Corsick Auomon
Building he Auomon To consruc he Aho-Corsick uomon, we need o consruc he rie, consruc suffix links, nd consruc oupu links. We know we cn build he rie in ime Θ(n) using our logic from before. How quickly cn we consruc suffix nd oupu links?
Consrucing Suffix Links
An Iniil Algorihm Here is simple, brue-force pproch for compuing suffix links: For ech node in he rie: Le α be he sring h his priculr node corresponds o. For ech proper suffix ω of α: Look up ω in he rie. If he serch ends up some rie node, poin he suffix link here nd sop. This pproch is no very efficien h doublynesed loop is excly he sor of hing we're rying o void. Cn we do beer?
e Pern Srings e c o s s o c o s s o
Fs Suffix Link Consrucion
Consrucing Suffix Links Key insigh: Suppose we know he suffix link for node lbeled w. Afer following rie edge lbeled, here re wo possibiliies. Cse 1: x exiss. w w w x x x
Consrucing Suffix Links Key insigh: Suppose we know he suffix link for node lbeled w. Afer following rie edge lbeled, here re wo possibiliies. Cse 2: x does no exis. w w x w x y y y
Consrucing Suffix Links Key insigh: Suppose we know he suffix link for node lbeled w. Afer following rie edge lbeled, here re wo possibiliies. Cse 2: x does no exis. w w w x x y z z z y
Consrucing Suffix Links To consruc he suffix link for node w: Follow w's suffix link o node x. If node x exiss, w hs suffix link o x. Oherwise, follow x's suffix link nd repe. If you need o follow bckwrds from he roo, hen w's suffix link poins o he roo. Observion 1: Suffix links poin from longer srings o shorer srings. Observion 2: If we precompue suffix links for nodes in scending order of sring lengh, ll of he informion needed for he bove pproch will be vilble he ime we need i.
Consrucing Suffix Links Do bredh-firs serch of he rie, performing he following operions: If he node is he roo, i hs no suffix link. If he node is one hop wy from he roo, is suffix link poins o he roo. Oherwise, he node corresponds o some sring w. Le x be he node poined by w's suffix link. Then, do he following: If he node x exiss, w's suffix link poins o x. Oherwise, if x is he roo node, w's suffix link poins o he roo. Oherwise, se x o he node poined by x's suffix link nd repe.
Anlyzing Efficiency How much ime does i ke o cully build ll he suffix links? When filling in ny individul suffix link, we migh hve o keep wlking bckwrds in he rie following suffix links repeedly while serching for plce o exend. Inuiively, i seems like i should be qudric in he lengh of he longes sring in he rie. Is h bound igh?
Anlyzing Efficiency Clim: The previously-described lgorihm for compuing suffix links kes ime O(n). Inuiion: Focus on ny one word in he rie. As you dd suffix links, keep rck of he deph of he node poined by he curren node's suffix link.
e c o s s o
Consrucion Efficiency Focus on he ime o fill in he suffix links for single pern of lengh h. The gold node (where he previous suffix link poins) begins he roo. A ech sep, he gold node kes some number of seps bckwrd, hen kes mos one sep forwrd. The gold node cnno ke more seps bckwrd hn forwrd. Therefore, cross he enire consrucion, he gold node kes mos h seps bckwrd. Tol ime required o consruc suffix links for pern of lengh h: O(h). Tol ime required o consruc ll suffix links: O(n).
Compuing Oupu Links
The Ide Some rie nodes represen srings h hve pern sring s proper suffix. Our gol is o inroduce oupu links so h, when hese nodes re visied, he uomon oupus ll he suffixes h end here.
Oupu Links, Formlly The oupu link node corresponding o sring w poins o he node corresponding o he longes proper suffix of w h is pern, or null if no such suffix exiss. By lwys poining o he node corresponding o he longes such word, we ensure h we chin ogeher ll he perns using oupu links.
i n Pern Srings i i n i n s i n g i n s i n g We We wn wn he he gold gold node node o o poin poin o o he he firs firs node node rechble rechble by by suffix suffix links links h's h's lso lso pern. pern. The The blue blue node node ( ( he he end end of of he he suffix suffix link) link) isn' isn' pern, pern, bu bu i i knows knows where where he he firs firs pern pern is. is. We We se se he he gold gold node's node's oupu oupu link link o o equl equl he he blue blue node's node's oupu oupu link. link.
i n Pern Srings i i n i n s i n g i n s i n g We We hve hve he he gold gold node node poin poin o o he he blue blue node node becuse becuse he he blue blue node node corresponds corresponds o o word. word.
Filling In Oupu Links Iniilly, se every node's oupu link o be null poiner. While doing he BFS o fill in suffix links, se he oupu link of he curren node v s follows: Le u be he node poined by v's suffix link. If u corresponds o pern, se v's oupu link o u iself. Oherwise, se v's oupu link o u's oupu link. Time complexiy of building ll oupu links: O(n).
The Ne Complexiy Our preprocessing ime is Θ(n) work o build he rie, O(n) work o fill in suffix links, nd O(n) work o fill in oupu links. Tol preprocessing ime: Θ(n).
The Finl Tols We now hve muli-sring serch d srucure wih ime complexiy O(n), O(m + z). In oher words, his is excepionlly good in he cse where here re fixed se of perns nd vrible sring o serch.
Where We're Going A powerful d srucure clled he suffix ree les us solve his problem in O(m), O(n + z). In oher words, i excels when here's fixed sring o serch nd vrible se of perns.
More o Explore There re number of oher pproches o solving his problem, nd here's ofen lrge gp beween heory nd prcice! The Boyer-Moore lgorihm serches for single pern in lrge ex. I cn cully run in subliner ime if he sring serched for isn' presen, bu runs in qudric cse if mch exiss. The Commenz-Wlz lgorihm generlizes Boyer-Moore for muliple srings nd hs similr ime gurnees, bu is fser in prcice. The Knuh-Morris-Pr lgorihm is specil cse of he Aho-Corsick lgorihm when here is jus one pern. You'll explore i on he upcoming problem se (fer he TAs confirm i's no oo difficul o derive i. )
Nex Time Suffix Trees A highly versile, flexible, powerful d srucure for sring processing. Prici Tries Shrinking down rie spce usge. Applicions of RMQ Geing some milege ou of Fischer-Heun.