Faster Regular Expression Matching. Philip Bille Mikkel Thorup

Similar documents
Anatomy of a Deterministic Finite Automaton. Deterministic Finite Automata. A machine so simple that you can understand it in less than one minute

Harvard University Computer Science 121 Midterm October 23, 2012

Theory of Computation Regular Languages. (NTU EE) Regular Languages Fall / 38

Agenda. Agenda. Regular Expressions. Examples of Regular Expressions. Regular Expressions (crash course) Computational Linguistics 1

Theory of Computation Regular Languages

CMSC 330: Organization of Programming Languages. DFAs, and NFAs, and Regexps (Oh my!)

1. For each of the following theorems, give a two or three sentence sketch of how the proof goes or why it is not true.

Module 9: Tries and String Matching

Module 9: Tries and String Matching

arxiv: v1 [cs.ds] 9 Apr 2018

CS375: Logic and Theory of Computing

Fundamentals of Computer Science

CS 301. Lecture 04 Regular Expressions. Stephen Checkoway. January 29, 2018

Where did dynamic programming come from?

Deterministic Finite-State Automata

1.3 Regular Expressions

Finite Automata. Informatics 2A: Lecture 3. John Longley. 22 September School of Informatics University of Edinburgh

Non-Deterministic Finite Automata. Fall 2018 Costas Busch - RPI 1

NFAs and Regular Expressions. NFA-ε, continued. Recall. Last class: Today: Fun:

Speech Recognition Lecture 2: Finite Automata and Finite-State Transducers

Lecture 6 Regular Grammars

CS415 Compilers. Lexical Analysis and. These slides are based on slides copyrighted by Keith Cooper, Ken Kennedy & Linda Torczon at Rice University

Let's start with an example:

NFA DFA Example 3 CMSC 330: Organization of Programming Languages. Equivalence of DFAs and NFAs. Equivalence of DFAs and NFAs (cont.

CMSC 330: Organization of Programming Languages

Finite-State Automata: Recap

1 Structural induction, finite automata, regular expressions

Regular expressions, Finite Automata, transition graphs are all the same!!

Types of Finite Automata. CMSC 330: Organization of Programming Languages. Comparing DFAs and NFAs. NFA for (a b)*abb.

Finite Automata Theory and Formal Languages TMV027/DIT321 LP4 2018

ɛ-closure, Kleene s Theorem,

Types of Finite Automata. CMSC 330: Organization of Programming Languages. Comparing DFAs and NFAs. Comparing DFAs and NFAs (cont.) Finite Automata 2

Regular Expressions (RE) Regular Expressions (RE) Regular Expressions (RE) Regular Expressions (RE) Kleene-*

Deterministic Finite Automata

Lexical Analysis Finite Automate

Speech Recognition Lecture 2: Finite Automata and Finite-State Transducers. Mehryar Mohri Courant Institute and Google Research

CHAPTER 1 Regular Languages. Contents

Automata and Languages

CISC 4090 Theory of Computation

Chapter 2 Finite Automata

Non Deterministic Automata. Linz: Nondeterministic Finite Accepters, page 51

NFAs continued, Closure Properties of Regular Languages

Finite Automata. Informatics 2A: Lecture 3. Mary Cryan. 21 September School of Informatics University of Edinburgh

1.4 Nonregular Languages

Nondeterminism. Nondeterministic Finite Automata. Example: Moves on a Chessboard. Nondeterminism (2) Example: Chessboard (2) Formal NFA

Non-deterministic Finite Automata

Lexical Analysis Part III

Java II Finite Automata I

Lecture 08: Feb. 08, 2019

Non-deterministic Finite Automata

Formal languages, automata, and theory of computation

FABER Formal Languages, Automata and Models of Computation

Homework 4. 0 ε 0. (00) ε 0 ε 0 (00) (11) CS 341: Foundations of Computer Science II Prof. Marvin Nakayama

Prefix-Free Regular-Expression Matching

DFA Minimization and Applications

Some Theory of Computation Exercises Week 1

Myhill-Nerode Theorem

Chapter Five: Nondeterministic Finite Automata. Formal Language, chapter 5, slide 1

1 Structural induction

Non-Deterministic Finite Automata

CSCI 340: Computational Models. Transition Graphs. Department of Computer Science

Fingerprint idea. Assume:

Compiler Design. Fall Lexical Analysis. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Non Deterministic Automata. Formal Languages and Automata - Yonsei CS 1

Convert the NFA into DFA

NFAs continued, Closure Properties of Regular Languages

Introduction to ω-autamata

Nondeterminism and Nodeterministic Automata

Worked out examples Finite Automata

CS 373, Spring Solutions to Mock midterm 1 (Based on first midterm in CS 273, Fall 2008.)

AUTOMATA AND LANGUAGES. Definition 1.5: Finite Automaton

Section: Other Models of Turing Machines. Definition: Two automata are equivalent if they accept the same language.

Turing Machines Part One

XPath Node Selection over Grammar-Compressed Trees

Algorithm Design and Analysis

Finite Automata-cont d

Automata Theory 101. Introduction. Outline. Introduction Finite Automata Regular Expressions ω-automata. Ralf Huuck.

1 From NFA to regular expression

Regular Languages and Applications

For convenience, we rewrite m2 s m2 = m m m ; where m is repeted m times. Since xyz = m m m nd jxyj»m, we hve tht the string y is substring of the fir

Learning Moore Machines from Input-Output Traces

Grammar. Languages. Content 5/10/16. Automata and Languages. Regular Languages. Regular Languages

80 CHAPTER 2. DFA S, NFA S, REGULAR LANGUAGES. 2.6 Finite State Automata With Output: Transducers

Strong Bisimulation. Overview. References. Actions Labeled transition system Transition semantics Simulation Bisimulation

Closure Properties of Regular Languages

CS 275 Automata and Formal Language Theory

Computing the Optimal Global Alignment Value. B = n. Score of = 1 Score of = a a c g a c g a. A = n. Classical Dynamic Programming: O(n )

Alignment of Long Sequences. BMI/CS Spring 2016 Anthony Gitter

Balanced binary search trees

12.1 Nondeterminism Nondeterministic Finite Automata. a a b ε. CS125 Lecture 12 Fall 2014

11.1 Finite Automata. CS125 Lecture 11 Fall Motivation: TMs without a tape: maybe we can at least fully understand such a simple model?

Regular languages refresher

CHAPTER 1 Regular Languages. Contents. definitions, examples, designing, regular operations. Non-deterministic Finite Automata (NFA)

3 Regular expressions

Learning Regular Languages over Large Alphabets

CSC 311 Theory of Computation

Probabilistic Model Checking Michaelmas Term Dr. Dave Parker. Department of Computer Science University of Oxford

In-depth introduction to main models, concepts of theory of computation:

Turing Machines Part One

CS:4330 Theory of Computation Spring Regular Languages. Equivalences between Finite automata and REs. Haniel Barbosa

Transcription:

Fster Regulr Expression Mtching Philip Bille Mikkel Thorup

Outline Definition Applictions History tour of regulr expression mtching Thompson s lgorithm Myers lgorithm New lgorithm Results nd extensions

Regulr Expressions A chrcter α is regulr expression. If S nd T re regulr expressions, then so is The union S T The conctention ST (S T) The kleene str S*

Lnguges The lnguge L(R) of regulr expression R is: L(α) = {α} L(S T) = L(S) L(T) L(ST) = L(S)L(T) L(S*) = {ε} L(S) L(S) 2 L(S) 3

An exmple R = (*)(b c) L(R) = {b, c, b, c, b, c,...}

Regulr Expression Mtching Given regulr expression R nd string Q the regulr expression mtching problem is to decide if Q L(R). How fst cn we solve regulr expression mtching for R = m nd Q = n?

Applictions Primitive in lrge scle dt processing: Internet Trffic Anlysis Protein serching XML queries Stndrd utilities nd tools Grep nd Sed Perl

Outline Definition Applictions History tour of regulr expression mtching Thompson s lgorithm Myers lgorithm New lgorithm Results nd extensions

Thompson s Algorithm 1968 () α (b) N(S) N(T ) ɛ N(S) ɛ ɛ (c) ɛ N(T ) ɛ (d) ɛ N(S) ɛ ɛ Construct non-deterministic finite utomton (NFA) from R.

Thompson s Algorithm 1968 R = ( ) (b c) 1 2 3 4 5 b 64 57 c 86 79 10 8 Thompson NFA (TNFA) N(R) hs O( R ) = O(m) sttes nd trnsitions. N(R) ccepts L(R). Any pth from strt to ccept stte corresponds to string in L(R) nd vice vers. Trverse TNFA on Q one chrcter t time. O(m) per chrcter => O( Q m) = O(nm) time lgorithm. Top ten list of problems in stringology 1985 [Glil1985].

Myers Algorithm 1992: 1-D decomposition A 3 A 1 A 2 1 2 3 4 5 b 64 57 c 86 79 10 8 Decompose N(R) into tree O(m/x) micro TNFAs with t most x = Θ(log n) sttes. Stte-sets nd micro TNFAs encoded in O(x) bits ([BFC2008]). Tbulte stte-set trnsitions for micro-tnfas. Tble size: 2 O(x) = O(n ε ). => constnt time for micro TNFA stte-set trnsition. O(m/x) = O(m/log n) stte-set trnsition lgorithm for N(R). O(nm/log n) lgorithm.

How cn we improve Myers lgorithm? A 3 A 1 A 2 1 2 3 4 5 b 64 57 c 86 79 10 8 Fst lgorithms [Myers1992, BFC2008, Bille2006] ll im to speedup stte-set trnsition for 1 chrcter. To red/write stte-set we need Ω(m/log n) time (we ssume log n word length). To improve we need to hndle multiple chrcters quickly. Min chllenge is dependencies from ε-trnsitions.

New Algorithm: 2-D decomposition A 3 A 1 A 2 1 2 3 4 5 b 64 57 c 86 79 10 8 Decompose N(R) into O(m/x) micro TNFAs with t most x = Θ(log n) sttes s Myers lgorithm. Prtition Q into segments of length y = Θ(log 1/2 n). Stte-set trnsition on segments in O(m/x) time. => lgorithm using O(nm/xy) = O(nm/log 1,5 n) time.

Overview Gol: Do stte set trnsition on y = Θ(log 1/2 n) chrcters in O(m/x) = O(m/ log n) time. Simplifying ssumption: constnt size lphbet. Algorithm: 4 trversls on tree of micro TNFAs. 1-3 itertively builds informtion. 4 computes the ctul stte-set trnsition. Tbultion to do ech trversl in constnt time per micro TNFA => O(m/x) time lgorithm.

Computing Accepted Substrings q = b A 3 A 1 A 2 1 2 3 4 5 b 64 57 c 86 79 10 8 Gol: For micro TNFA A compute the substrings of q tht re ccepted by Ā. We hve A1 : {ε,,}, A2 : {b}, A3 : {b,b}. Bottom-up trversl using tbultion in constnt time per micro TNFA. Encode set of substring in O(y 2 ) = O(log n) bits. Tble input: micro TNFA, substrings of children, q. Tble size 2 O(x + y 2+ y) = 2 O(x + y 2) = O(n ε ).

Computing Pth Prefixes to Accepting Sttes q = b S = {1,3} A 3 A 1 A 2 1 2 3 4 5 b 64 57 c 86 79 10 8 Gol: For micro TNFA A compute the prefixes of q mtching pth from S to the ccepting stte in Ā. We hve A1 : {, }, A2 :, A3 : {b}. Bottom-up trversl using tbultion in constnt time per micro TNFA. Encode prefixes in O(y) = O(log 1/2 n) bits. Tble input: micro TNFA, substrings nd pth prefixes of children, q, stte-set for A. Tble size 2 O(x + y 2) = O(n ε ).

Computing Pth Prefixes to Strt Sttes q = b S = {1,3} A 3 A 1 A 2 1 2 3 4 5 b 64 57 c 86 79 10 8 Gol: For micro TNFA A compute the prefixes of q mtching pth from S to the strt stte in N(R). We hve A1 : {}, A2 : {, }, A3 : {ε}. Top-down trversl using tbultion in constnt time per micro TNFA. Tbultion: Similr to previous trversl.

Updting Stte-Sets q = b S = {1,3} A 3 A 1 A 2 1 2 3 4 5 b 64 57 c 86 79 10 8 Gol: For micro TNFA A compute the next stte-set. We hve A1 :, A2 : {7,10}, A3 : {10}. Trversl using tbultion in constnt time per micro TNFA. Tbultion: Similr to previous trversl.

Algorithm Summry Tbultion in 2 O(x + y 2) = O(n ε ) time nd spce. 4 trversls ech using O(m/x) time to process length y segment of Q. => lgorithm using O(nm/xy) = O(nm/log 1,5 n) time nd O(n ε ) spce. Hence, we hve $\sqrt{\log n}$ ed (slogn ed) Myers result.

Extensions Unbounded lphbets cost dditionl log log n fctor in speed. Unbounded lphbets for free if m n 1-ε. I/O bounds: 1-D decomposition gives O(nm/B) I/Os, we get O(nm/ M 1/2 B) I/ Os. Other fetures: Independent tbultion. Streming.