Aho-Corasick Automata

Similar documents
4.8 Improper Integrals

Motion. Part 2: Constant Acceleration. Acceleration. October Lab Physics. Ms. Levine 1. Acceleration. Acceleration. Units for Acceleration.

e t dt e t dt = lim e t dt T (1 e T ) = 1

Chapter 2: Evaluative Feedback

The solution is often represented as a vector: 2xI + 4X2 + 2X3 + 4X4 + 2X5 = 4 2xI + 4X2 + 3X3 + 3X4 + 3X5 = 4. 3xI + 6X2 + 6X3 + 3X4 + 6X5 = 6.

Minimum Squared Error

Minimum Squared Error

f t f a f x dx By Lin McMullin f x dx= f b f a. 2

5.1-The Initial-Value Problems For Ordinary Differential Equations

Contraction Mapping Principle Approach to Differential Equations

An integral having either an infinite limit of integration or an unbounded integrand is called improper. Here are two examples.

Some basic notation and terminology. Deterministic Finite Automata. COMP218: Decision, Computation and Language Note 1

A Kalman filtering simulation

(b) 10 yr. (b) 13 m. 1.6 m s, m s m s (c) 13.1 s. 32. (a) 20.0 s (b) No, the minimum distance to stop = 1.00 km. 1.

ENGR 1990 Engineering Mathematics The Integral of a Function as a Function

0 for t < 0 1 for t > 0

REAL ANALYSIS I HOMEWORK 3. Chapter 1

Average & instantaneous velocity and acceleration Motion with constant acceleration

September 20 Homework Solutions

Longest Common Prefixes

Lecture 2-1 Kinematics in One Dimension Displacement, Velocity and Acceleration Everything in the world is moving. Nothing stays still.

Traversal of a subtree is slow, which affects prefix and range queries.

PHYSICS 1210 Exam 1 University of Wyoming 14 February points

Physics 2A HW #3 Solutions

A 1.3 m 2.5 m 2.8 m. x = m m = 8400 m. y = 4900 m 3200 m = 1700 m

Properties of Logarithms. Solving Exponential and Logarithmic Equations. Properties of Logarithms. Properties of Logarithms. ( x)

( ) ( ) ( ) ( ) ( ) ( y )

Solutions to Problems from Chapter 2

Chapter 7: Solving Trig Equations

1. Consider a PSA initially at rest in the beginning of the left-hand end of a long ISS corridor. Assume xo = 0 on the left end of the ISS corridor.

INTEGRALS. Exercise 1. Let f : [a, b] R be bounded, and let P and Q be partitions of [a, b]. Prove that if P Q then U(P ) U(Q) and L(P ) L(Q).

1 jordan.mcd Eigenvalue-eigenvector approach to solving first order ODEs. -- Jordan normal (canonical) form. Instructor: Nam Sun Wang

P441 Analytical Mechanics - I. Coupled Oscillators. c Alex R. Dzierba

Module 9: Tries and String Matching

Module 9: Tries and String Matching

Inventory Analysis and Management. Multi-Period Stochastic Models: Optimality of (s, S) Policy for K-Convex Objective Functions

Solutions for Assignment 2

Flow Networks Alon Efrat Slides courtesy of Charles Leiserson with small changes by Carola Wenk. Flow networks. Flow networks CS 445

ECE Microwave Engineering. Fall Prof. David R. Jackson Dept. of ECE. Notes 10. Waveguides Part 7: Transverse Equivalent Network (TEN)

Chapter 2. Motion along a straight line. 9/9/2015 Physics 218

exact matching: topics

1.0 Electrical Systems

graph of unit step function t

MATH 124 AND 125 FINAL EXAM REVIEW PACKET (Revised spring 2008)

Forms of Energy. Mass = Energy. Page 1. SPH4U: Introduction to Work. Work & Energy. Particle Physics:

Magnetostatics Bar Magnet. Magnetostatics Oersted s Experiment

Question Details Int Vocab 1 [ ] Question Details Int Vocab 2 [ ]

3. Renewal Limit Theorems

Reinforcement Learning

Lecture 3: 1-D Kinematics. This Week s Announcements: Class Webpage: visit regularly

Biol. 356 Lab 8. Mortality, Recruitment, and Migration Rates

Math 2142 Exam 1 Review Problems. x 2 + f (0) 3! for the 3rd Taylor polynomial at x = 0. To calculate the various quantities:

Robotics I. April 11, The kinematics of a 3R spatial robot is specified by the Denavit-Hartenberg parameters in Tab. 1.

Mathematics 805 Final Examination Answers

1 Review of Zero-Sum Games

Written HW 9 Sol. CS 188 Fall Introduction to Artificial Intelligence

Matlab and Python programming: how to get started

A new model for limit order book dynamics

Some Inequalities variations on a common theme Lecture I, UL 2007

Notes for Lecture 17-18

Exact Minimization of # of Joins

Physic 231 Lecture 4. Mi it ftd l t. Main points of today s lecture: Example: addition of velocities Trajectories of objects in 2 = =

3 Motion with constant acceleration: Linear and projectile motion

CMU-Q Lecture 3: Search algorithms: Informed. Teacher: Gianni A. Di Caro

f(x) dx with An integral having either an infinite limit of integration or an unbounded integrand is called improper. Here are two examples dx x x 2

Temperature Rise of the Earth

ESSLLI 2007 COURSE READER. ESSLLI is the Annual Summer School of FoLLI, The Association for Logic, Language and Information

= ( ) ) or a system of differential equations with continuous parametrization (T = R

Retrieval Models. Boolean and Vector Space Retrieval Models. Common Preprocessing Steps. Boolean Model. Boolean Retrieval Model

Version 001 test-1 swinney (57010) 1. is constant at m/s.

Two Coupled Oscillators / Normal Modes

Chapter 2. First Order Scalar Equations

1. Find a basis for the row space of each of the following matrices. Your basis should consist of rows of the original matrix.

Christos Papadimitriou & Luca Trevisan November 22, 2016

( ) a system of differential equations with continuous parametrization ( T = R + These look like, respectively:

Math 333 Problem Set #2 Solution 14 February 2003

EXISTENCE AND UNIQUENESS OF SOLUTIONS FOR A SECOND-ORDER ITERATIVE BOUNDARY-VALUE PROBLEM

SOLUTIONS TO ECE 3084

Seminar 4: Hotelling 2

Logic in computer science

MAT 266 Calculus for Engineers II Notes on Chapter 6 Professor: John Quigg Semester: spring 2017

Chapter Direct Method of Interpolation

Chapter 3 Boundary Value Problem

Assignment 6. Tyler Shendruk December 6, 2010

Stationary Distribution. Design and Analysis of Algorithms Andrei Bulatov

Math 426: Probability Final Exam Practice

Supplement for Stochastic Convex Optimization: Faster Local Growth Implies Faster Global Convergence

2D Motion WS. A horizontally launched projectile s initial vertical velocity is zero. Solve the following problems with this information.

Self assessment due: Monday 4/29/2019 at 11:59pm (submit via Gradescope)

T-Match: Matching Techniques For Driving Yagi-Uda Antennas: T-Match. 2a s. Z in. (Sections 9.5 & 9.7 of Balanis)

Dynamic Programming 11/8/2009. Weighted Interval Scheduling. Weighted Interval Scheduling. Unweighted Interval Scheduling: Review

Let us start with a two dimensional case. We consider a vector ( x,

Phys 110. Answers to even numbered problems on Midterm Map

Echocardiography Project and Finite Fourier Series

Math 10B: Mock Mid II. April 13, 2016

From Complex Fourier Series to Fourier Transforms

3D Transformations. Computer Graphics COMP 770 (236) Spring Instructor: Brandon Lloyd 1/26/07 1

Guest Lectures for Dr. MacFarlane s EE3350 Part Deux

Some Basic Information about M-S-D Systems

Solutions from Chapter 9.1 and 9.2

Transcription:

Aho-Corsick Auom

Sring D Srucures Over he nex few dys, we're going o be exploring d srucures specificlly designed for sring processing. These d srucures nd heir vrins re frequenly used in prcice

Looking Forwrd Tody: Aho-Corsick Auom A fs d srucure for sring mching. Thursdy: Suffix Trees An bsurdly versile sring d srucure. Tuesdy: Suffix Arrys Suffix-ree like performnce wih rry-like spce usge.

Sring Serching

The Sring Serching Problem Consider he following problem: Given sring T nd k nonempy srings P₁,, Pₖ, find ll occurrences of P₁,, Pₖ in T. T is clled he ex sring nd P₁,, Pₖ re clled pern srings. This problem ws originlly sudied in he conex of compiling indexes, bu hs found pplicions in compuer securiy nd compuionl genomics.

Pern Srings b b o e b e b e d e d g g e u e b e d g e b

Some Terminology Le m = T, he lengh of he sring o be serched. Le n = P₁ + P₂ + + Pₖ be he ol lengh of ll he pern srings. Le Lmx be he lengh of he longes pern sring. Assume h srings re drwn from n lphbe Σ, where Σ is some consn. We'll use hese erms when lking bou he runime of he lgorihms nd d srucures we'll explore over he nex couple of dys.

How quickly cn we solve he sring serching problem?

Le's sr wih nïve pproch.

Pern Srings b b o u e b e b e d e d g e g e For For ech ech posiion posiion in in T: T: For For ech ech pern pern sring sring Pᵢ: Pᵢ: Check Check if if Pᵢ Pᵢ ppers ppers h h posiion. posiion. b e d g e b

Anlyzing Our Approch As before, le m be he lengh of he ex nd n he ol lengh of he pern srings. For ech chrcer of he ex sring T, in he wors cse, we scn over ll n ol chrcers in he perns. Time complexiy: O(mn). Is his igh bound?

Θ(mn) Pern Srings

Cn we do beer?

Pern Srings b b o u e b e b e d e d g e g e b e d g e

Prllel Serching Ide: Rher hn serching he pern srings in seril, ry serching hem in prllel. Inuiively, his should cu down on lo of he unnecessry rescnning h we're doing. Chllenge: How excly do we do his in prcice?

g Pern Srings b b o u e b e b e d e d g e g e o u b e b e d e d g e e This This d d srucure srucure is is clled clled rie. rie. I I comes comes from from he he word word rerievl. rerievl. I I is is no no pronounced pronounced like like rerievl. rerievl.

Represening Tries Ech rie node needs o sore poiners o is children. There re mny differen d srucures we could use o sore hese poiners. For ody, we'll ssume we hve n rry of Σ poiners, one per possible child. You'll explore vrins on his sregy in he problem se. c

Represening Tries Ech rie node needs o sore poiners o is children. There re mny differen d srucures we could use o sore hese poiners. For ody, we'll ssume we hve n rry of Σ poiners, one per possible child. You'll explore vrins on his sregy in he problem se.

e b b o b e b e d e d g g Pern Srings e e u b o u e b e d e d g e g e b e d g e

Anlyzing our New Algorihm Le's suppose we've lredy consruced he rie. How much work is required o perform he mch? For ech chrcer of T, we inspec s mos s mny chrcers s exis in he deepes brnch of he rie. Time complexiy: O(mLmx ), where L mx is he lengh of he longes pern sring. (Do you see why?) In he (resonble) cse where Lmx is much smller hn n, his is huge win over before. If L mx is objecively smll, his is prey good runime. How much ime does i ke o build he rie?

Building Trie Clim: Given se of srings P₁,, Pₖ of ol lengh n, i's possible o build rie for hose srings in ime Θ(n). e e b n e b n e

Our Sregies Following our fory ino RMQ, we'll sy h soluion o muli-sring mching runs in ime p(m, n), q(m, n) if he preprocessing ime is p(m, n) nd he mching ime is q(m, n). We now hve wo pproches: No preprocessing: O(1), O(mn). Trie serching: O(n), O(mLmx ). Cn we do beer?

Pern Srings r o r s s o r r o r s s o r s o r s

Pern Srings r o r s s o r r o r s s o r s o r s

Pern Srings r o r s s o r r o r s s o r s o r s This This red red link link is is clled clled suffix suffix link. link. We'll We'll lk lk bou bou hem hem more more formlly formlly in in minue. minue.

Pern Srings r o r s s o r r o r s s o r s o

Pern Srings r o r s s o r r o r s s o r s o

Pern Srings r o r s s o r r o r s s o r s o

Pern Srings r o r s s o r r o r s s o r s o

Pern Srings r o r s s o r r o r s s o r s o r s o r s In In generl, generl, suffix suffix links links migh migh jump jump he he red red cursor cursor forwrd forwrd more more hn hn one one sep. sep. The The number number of of seps seps ken ken is is equl equl o o he he chnge chnge of of deph deph in in he he rie. rie.

Pern Srings r o r s s o r r o r s s o r s o

Pern Srings r o r s s o r r o r s s o r s o

Suffix Links A suffix link (someimes clled filure link) is red edge from rie node corresponding o sring α o he rie node corresponding o sring ω such h ω is he longes proper suffix of α h is sill in he rie. Inuiion: When we hi pr of he sring where we cnno coninue o red chrcers, we fll bck by following suffix links o ry o preserve s much conex s possible. Every node in he rie, excep he roo (which corresponds o he empy sring ε), will hve suffix link ssocied wih i.

Why Suffix Links Mer Suffix links cn subsnilly improve he performnce of our sring serch. A ech sep, we eiher dvnce he blck (end) poiner forwrd in he rie, or dvnce he red (sr) poiner forwrd. Ech poiner cn dvnce forwrd mos O(m) imes. This reduces he moun of ime spen scnning chrcers from O(mL mx ) down o Θ(m). This is only useful if we cn compue suffix links quickly... which we'll see how o do ler.

A Problem wih our Opimizion

i n Pern Srings i i n i n s i n g i n s i n g s i n g

Wh Hppened? Our hevily opimized sring sercher no longer srs serching from ech posiion in he sring. As resul, we now migh forge o oupu mches in cerin cses. We need o figure ou when his hppens, nd how o correc for i.

i n Pern Srings i i n i n s i n g i n s i n g s i n g We We missed missed he he pern pern sring sring i i becuse becuse i's i's proper proper suffix suffix si. si.

i n Pern Srings i i n i n s i n g i n s i n g s i n g We missed boh in nd in We missed boh in nd in becuse becuse ech ech is is proper proper suffix suffix of of sin. sin.

How do we ddress his?

i n Pern Srings i i n i n s i n g i n s i n g This This blue blue rrow rrow is is clled clled n n oupu oupu link. link. Whenever Whenever we we visi visi his his gold gold node, node, we'll we'll oupu oupu he he sring sring represened represened by by he he node node he he end end of of he he blue blue rrow. rrow.

i n Pern Srings i i n i n s i n g i n s i n g By By precompuing precompuing where where we we evenully evenully need need o o end end up, up, we we cn cn insnly insnly red red off off ny ny exr exr perns perns o o emi emi his his poin. poin. As As you'll you'll see, see, we we cn cn precompue precompue hese hese links links relly relly quickly! quickly!

i n Pern Srings i i n i n s i n g i n s i n g Even Even nodes nodes h h hemselves hemselves correspond correspond o o re re perns perns migh migh need need oupu oupu links links if if oher oher perns perns lso lso end end he he corresponding corresponding sring. sring.

i n Pern Srings i i n i n s i n g i n s i n g Noice Noice h h he he blue blue edges edges here here form form linked linked lis. lis. If If we we visi visi his his node, node, we we need need o o oupu oupu everyhing everyhing in in he he chin, chin, no no jus jus he he in in node node we're we're immediely immediely poining poining..

The Finl Mching Algorihm Sr he roo node in he rie. For ech chrcer c in he sring: While here is no edge lbeled c: If you're he roo, brek ou of his loop. Oherwise, follow suffix link. If here is n edge lbeled c, follow i. If he curren node corresponds o pern, oupu h pern. Oupu ll words in he chin of oupu links origining his node.

The Runime Impc

Pern Srings

The Runime In he wors cse, we my hve o spend huge moun of ime lising off ll he mches in he sring. This isn' he ful of he lgorihm ny lgorihm h mches srings his wy would hve o spend he ime reporing mches. To ccoun for his, le z denoe he number of mches repored by our lgorihm. The runime of he mch phse is hen Θ(m + z), wih he m erm coming from he sring scnning nd he z erm coming from he mches. You someimes her lgorihms whose runime depends on how much oupu is genered referred o s oupu-sensiive lgorihms.

Where We Are Given he mching uomon (which is clled n Aho-Corsick uomon or n AC uomon), we cn find ll occurrences of he pern srings in ny ex of lengh m in ime Θ(m+z). To see wheher his is worhwhile, we need o see how quickly we cn build he uomon.

Time-Ou for Announcemens!

Problem Se One As friendly reminder, Problem Se One is due his Thursdy 3:00PM. All soluions mus be submied elecroniclly hrough GrdeScope. We srongly recommend leving few hours' buffer ime so h you cn ge everyhing se up properly. If you hven' sred ye... you probbly should go nd do h. We've go office hours hroughou he week if you hve quesions nd you're welcome o sk quesions on Pizz.

HckOverflow Snford WiCS is hosing HckOverflow, hckhon for progrmmers of ll skill levels. I's coming up on Surdy, April 16 from 10AM 10PM. Everyone is welcome! Highly recommended! If you've never been o hckhon before, his is one of he bes plces o sr. Wn o end? RSVP using his link. Wn o voluneer he even or serve s menor? RSVP his link.

ostem Mixer Snford's chper of ostem (Ou in STEM) is hosing mixer even omorrow, April 6, 6PM he LGBT-CRC. Ineresed in ending? Wn o ge involved in ostem ledership? Feel free o sop on by! Everyone is welcome. If you'd like o RSVP, you cn use his link.

Bck o CS166!

Building he Aho-Corsick Auomon

Building he Auomon To consruc he Aho-Corsick uomon, we need o consruc he rie, consruc suffix links, nd consruc oupu links. We know we cn build he rie in ime Θ(n) using our logic from before. How quickly cn we consruc suffix nd oupu links?

Consrucing Suffix Links

An Iniil Algorihm Here is simple, brue-force pproch for compuing suffix links: For ech node in he rie: Le α be he sring h his priculr node corresponds o. For ech proper suffix ω of α: Look up ω in he rie. If he serch ends up some rie node, poin he suffix link here nd sop. This pproch is no very efficien h doublynesed loop is excly he sor of hing we're rying o void. Cn we do beer?

e Pern Srings e c o s s o c o s s o

Fs Suffix Link Consrucion

Consrucing Suffix Links Key insigh: Suppose we know he suffix link for node lbeled w. Afer following rie edge lbeled, here re wo possibiliies. Cse 1: x exiss. w w w x x x

Consrucing Suffix Links Key insigh: Suppose we know he suffix link for node lbeled w. Afer following rie edge lbeled, here re wo possibiliies. Cse 2: x does no exis. w w x w x y y y

Consrucing Suffix Links Key insigh: Suppose we know he suffix link for node lbeled w. Afer following rie edge lbeled, here re wo possibiliies. Cse 2: x does no exis. w w w x x y z z z y

Consrucing Suffix Links To consruc he suffix link for node w: Follow w's suffix link o node x. If node x exiss, w hs suffix link o x. Oherwise, follow x's suffix link nd repe. If you need o follow bckwrds from he roo, hen w's suffix link poins o he roo. Observion 1: Suffix links poin from longer srings o shorer srings. Observion 2: If we precompue suffix links for nodes in scending order of sring lengh, ll of he informion needed for he bove pproch will be vilble he ime we need i.

Consrucing Suffix Links Do bredh-firs serch of he rie, performing he following operions: If he node is he roo, i hs no suffix link. If he node is one hop wy from he roo, is suffix link poins o he roo. Oherwise, he node corresponds o some sring w. Le x be he node poined by w's suffix link. Then, do he following: If he node x exiss, w's suffix link poins o x. Oherwise, if x is he roo node, w's suffix link poins o he roo. Oherwise, se x o he node poined by x's suffix link nd repe.

Anlyzing Efficiency How much ime does i ke o cully build ll he suffix links? When filling in ny individul suffix link, we migh hve o keep wlking bckwrds in he rie following suffix links repeedly while serching for plce o exend. Inuiively, i seems like i should be qudric in he lengh of he longes sring in he rie. Is h bound igh?

Anlyzing Efficiency Clim: The previously-described lgorihm for compuing suffix links kes ime O(n). Inuiion: Focus on ny one word in he rie. As you dd suffix links, keep rck of he deph of he node poined by he curren node's suffix link.

e c o s s o

Consrucion Efficiency Focus on he ime o fill in he suffix links for single pern of lengh h. The gold node (where he previous suffix link poins) begins he roo. A ech sep, he gold node kes some number of seps bckwrd, hen kes mos one sep forwrd. The gold node cnno ke more seps bckwrd hn forwrd. Therefore, cross he enire consrucion, he gold node kes mos h seps bckwrd. Tol ime required o consruc suffix links for pern of lengh h: O(h). Tol ime required o consruc ll suffix links: O(n).

Compuing Oupu Links

The Ide Some rie nodes represen srings h hve pern sring s proper suffix. Our gol is o inroduce oupu links so h, when hese nodes re visied, he uomon oupus ll he suffixes h end here.

Oupu Links, Formlly The oupu link node corresponding o sring w poins o he node corresponding o he longes proper suffix of w h is pern, or null if no such suffix exiss. By lwys poining o he node corresponding o he longes such word, we ensure h we chin ogeher ll he perns using oupu links.

i n Pern Srings i i n i n s i n g i n s i n g We We wn wn he he gold gold node node o o poin poin o o he he firs firs node node rechble rechble by by suffix suffix links links h's h's lso lso pern. pern. The The blue blue node node ( ( he he end end of of he he suffix suffix link) link) isn' isn' pern, pern, bu bu i i knows knows where where he he firs firs pern pern is. is. We We se se he he gold gold node's node's oupu oupu link link o o equl equl he he blue blue node's node's oupu oupu link. link.

i n Pern Srings i i n i n s i n g i n s i n g We We hve hve he he gold gold node node poin poin o o he he blue blue node node becuse becuse he he blue blue node node corresponds corresponds o o word. word.

Filling In Oupu Links Iniilly, se every node's oupu link o be null poiner. While doing he BFS o fill in suffix links, se he oupu link of he curren node v s follows: Le u be he node poined by v's suffix link. If u corresponds o pern, se v's oupu link o u iself. Oherwise, se v's oupu link o u's oupu link. Time complexiy of building ll oupu links: O(n).

The Ne Complexiy Our preprocessing ime is Θ(n) work o build he rie, O(n) work o fill in suffix links, nd O(n) work o fill in oupu links. Tol preprocessing ime: Θ(n).

The Finl Tols We now hve muli-sring serch d srucure wih ime complexiy O(n), O(m + z). In oher words, his is excepionlly good in he cse where here re fixed se of perns nd vrible sring o serch.

Where We're Going A powerful d srucure clled he suffix ree les us solve his problem in O(m), O(n + z). In oher words, i excels when here's fixed sring o serch nd vrible se of perns.

More o Explore There re number of oher pproches o solving his problem, nd here's ofen lrge gp beween heory nd prcice! The Boyer-Moore lgorihm serches for single pern in lrge ex. I cn cully run in subliner ime if he sring serched for isn' presen, bu runs in qudric cse if mch exiss. The Commenz-Wlz lgorihm generlizes Boyer-Moore for muliple srings nd hs similr ime gurnees, bu is fser in prcice. The Knuh-Morris-Pr lgorihm is specil cse of he Aho-Corsick lgorihm when here is jus one pern. You'll explore i on he upcoming problem se (fer he TAs confirm i's no oo difficul o derive i. )

Nex Time Suffix Trees A highly versile, flexible, powerful d srucure for sring processing. Prici Tries Shrinking down rie spce usge. Applicions of RMQ Geing some milege ou of Fischer-Heun.