Regular Expressions and Finite-State Automata. L545 Spring 2008

Similar documents
Some useful tasks involving language. Finite-State Machines and Regular Languages. More useful tasks involving language. Regular expressions

Overview. Regular Expressions and Finite-State. Motivation. Regular expressions. RE syntax Additional functions. Regular languages Properties

Regular expressions and automata

UNIT-I. Strings, Alphabets, Language and Operations

Finite-state machines (FSMs)

Uses of finite automata

CS 154. Finite Automata vs Regular Expressions, Non-Regular Languages

Theory of Computation

CMSC 330: Organization of Programming Languages. Regular Expressions and Finite Automata

CMSC 330: Organization of Programming Languages

MA/CSSE 474 Theory of Computation

Finite State Automata Design

CS 121, Section 2. Week of September 16, 2013

Chapter 5. Finite Automata

CS:4330 Theory of Computation Spring Regular Languages. Finite Automata and Regular Expressions. Haniel Barbosa

Deterministic Finite Automata (DFAs)

What Is a Language? Grammars, Languages, and Machines. Strings: the Building Blocks of Languages

Nondeterministic Finite Automata and Regular Expressions

Deterministic Finite Automata (DFAs)

UNIT-III REGULAR LANGUAGES

Languages. A language is a set of strings. String: A sequence of letters. Examples: cat, dog, house, Defined over an alphabet:

cse303 ELEMENTS OF THE THEORY OF COMPUTATION Professor Anita Wasilewska

CS 154, Lecture 3: DFA NFA, Regular Expressions

UNIT-II. NONDETERMINISTIC FINITE AUTOMATA WITH ε TRANSITIONS: SIGNIFICANCE. Use of ε-transitions. s t a r t. ε r. e g u l a r

CSCI 340: Computational Models. Regular Expressions. Department of Computer Science

How do regular expressions work? CMSC 330: Organization of Programming Languages

Finite State Transducers

Closure under the Regular Operations

Finite Automata. Finite Automata

Theory of Computation

Automata: a short introduction

CSE 105 THEORY OF COMPUTATION

CS 154, Lecture 2: Finite Automata, Closure Properties Nondeterminism,

CS 133 : Automata Theory and Computability

Deterministic Finite Automata (DFAs)

CS 154. Finite Automata, Nondeterminism, Regular Expressions

Theory of computation: initial remarks (Chapter 11)

Deterministic Finite Automaton (DFA)

Automata Theory. Lecture on Discussion Course of CS120. Runzhe SJTU ACM CLASS

Introduction to Language Theory and Compilation: Exercises. Session 2: Regular expressions

Sri vidya college of engineering and technology

UNIT II REGULAR LANGUAGES

FORMAL LANGUAGES, AUTOMATA AND COMPUTABILITY

Proofs, Strings, and Finite Automata. CS154 Chris Pollett Feb 5, 2007.

Finite-state Machines: Theory and Applications

Theory of Computation p.1/?? Theory of Computation p.2/?? Unknown: Implicitly a Boolean variable: true if a word is

Formal Languages. We ll use the English language as a running example.

Author: Vivek Kulkarni ( )

Part I: Definitions and Properties

Theory of computation: initial remarks (Chapter 11)

Computational Theory

GEETANJALI INSTITUTE OF TECHNICAL STUDIES, UDAIPUR I

Clarifications from last time. This Lecture. Last Lecture. CMSC 330: Organization of Programming Languages. Finite Automata.

CHAPTER 1 Regular Languages. Contents

TAFL 1 (ECS-403) Unit- III. 3.1 Definition of CFG (Context Free Grammar) and problems. 3.2 Derivation. 3.3 Ambiguity in Grammar

Author: Vivek Kulkarni ( )

Foundations of

Lexical Analysis. Reinhard Wilhelm, Sebastian Hack, Mooly Sagiv Saarland University, Tel Aviv University.

CSE 105 THEORY OF COMPUTATION

jflap demo Regular expressions Pumping lemma Turing Machines Sections 12.4 and 12.5 in the text

FORMAL LANGUAGES, AUTOMATA AND COMPUTABILITY

CSC173 Workshop: 13 Sept. Notes

Great Theoretical Ideas in Computer Science. Lecture 4: Deterministic Finite Automaton (DFA), Part 2

Finite Automata Part One

CISC 4090: Theory of Computation Chapter 1 Regular Languages. Section 1.1: Finite Automata. What is a computer? Finite automata

Automata & languages. A primer on the Theory of Computation. Laurent Vanbever. ETH Zürich (D-ITET) September,

Chapter 3. Regular grammars

CSE 105 THEORY OF COMPUTATION

TAFL 1 (ECS-403) Unit- II. 2.1 Regular Expression: The Operators of Regular Expressions: Building Regular Expressions

T (s, xa) = T (T (s, x), a). The language recognized by M, denoted L(M), is the set of strings accepted by M. That is,

NODIA AND COMPANY. GATE SOLVED PAPER Computer Science Engineering Theory of Computation. Copyright By NODIA & COMPANY

CS 455/555: Finite automata

Embedded systems specification and design

CS311 Computational Structures Regular Languages and Regular Expressions. Lecture 4. Andrew P. Black Andrew Tolmach

A Universal Turing Machine

FINITE STATE AUTOMATA

CMSC 330: Organization of Programming Languages. Theory of Regular Expressions Finite Automata

Finite Automata. BİL405 - Automata Theory and Formal Languages 1

Tasks of lexer. CISC 5920: Compiler Construction Chapter 2 Lexical Analysis. Tokens and lexemes. Buffering

Inf2A: Converting from NFAs to DFAs and Closure Properties

CS 154 Introduction to Automata and Complexity Theory

acs-04: Regular Languages Regular Languages Andreas Karwath & Malte Helmert Informatik Theorie II (A) WS2009/10

Lecture Notes On THEORY OF COMPUTATION MODULE -1 UNIT - 2

Finite Automata Theory and Formal Languages TMV027/DIT321 LP4 2018

Formal Definition of a Finite Automaton. August 26, 2013

What is this course about?

Nondeterministic Finite Automata

1. (a) Explain the procedure to convert Context Free Grammar to Push Down Automata.

Griffith University 3130CIT Theory of Computation (Based on slides by Harald Søndergaard of The University of Melbourne) Turing Machines 9-0

Kleene Algebras and Algebraic Path Problems

Deterministic Finite Automata (DFAs)

Computer Sciences Department

Finite State Automata

CS Automata, Computability and Formal Languages

Theory Bridge Exam Example Questions

COSE312: Compilers. Lecture 2 Lexical Analysis (1)

Formal Languages. We ll use the English language as a running example.

Text Search and Closure Properties

Automata & languages. A primer on the Theory of Computation. Laurent Vanbever. ETH Zürich (D-ITET) September,

Theory of Computation (I) Yijia Chen Fudan University

Transcription:

Regular Expressions and Finite-State Automata L545 Spring 2008

Overview Finite-state technology is: Fast and efficient Useful for a variety of language tasks Three main topics we ll discuss: Regular Expressions (REs) Finite-State Automata (FSAs) Properties of Regular Languages Well find that REs and FSAs are mathematically equivalent, but help us approach problems in different ways 2

Some useful tasks involving language Find all phone numbers in a text, e.g., occurrences such as When you call (614) 292-8833, you reach the fax machine. Find multiple adjacent occurrences of the same word in a text, as in I read the the book. Determine the language of the following utterance: French or Polish? Czy pasazer jadacy do Warszawy moze jechac przez Londyn? 3

More useful tasks involving language Look up the following words in a dictionary: laughs, became, unidentifiable, Thatcherization Determine the part-of-speech of words like the following, even if you can t find them in the dictionary: conurbation, cadence, disproportionality, lyricism, parlance Such tasks can be addressed using so-called finite-state machines. How can such machines be specified? 4

Regular expressions A regular expression is a description of a set of strings, i.e., a language. They can be used to search for occurrences of these strings A variety of unix tools (grep, sed), editors (emacs), and programming languages (perl, python) incorporate regular expressions. Just like any other formalism, regular expressions as such have no linguistic contents, but they can be used to refer to linguistic units. 5

The syntax of regular expressions (1) Regular expressions consist of strings of characters: c, A100, natural language, 30 years! disjunction: ordinary disjunction: devoured ate, famil(y ies) character classes: [Tt]he, bec[oa]me ranges: [A-Z] (a capital letter) negation:[ˆa] (any symbol but a) [ˆA-Z0-9] (not an uppercase letter or number) 6

The syntax of regular expressions (2) counters optionality:? colou?r any number of occurrences: * (Kleene star) [0-9]* years at least one occurrence: + [0-9]+ dollars wildcard for any character:. beg.n for any character in between beg and n Parentheses to group items together ant(farm)? Escaped characters to specify characters with special meanings: \*, \+, \?, \(, \), \, \[, \] 7

The syntax of regular expressions (3) Operator precedence, from highest to lowest: parentheses () counters * +? character sequences disjunction fire ing = fire or ing fir(e ing) = fir followed by either e or ing Note: The various unix tools and languages differ w.r.t. the exact syntax of the regular expressions they allow. 8

Additional functionality for some RE uses (1) Although not a part of our discussion about regular languages, some tools (e.g., Perl) allow for more functionality Anchors: anchor expressions to various parts of the string ˆ = start of line do not confuse with [ˆ...] used to express negation $ = end of line \b non-word character word characters are digits, underscores, or letters, i.e., [0-9A-Za-z ] 9

Additional functionality for some RE uses (2) Use aliases to designate particular recurrent sets of characters \d = [0-9]: digit \D = [ˆ\d]: non-digit \w = [a-za-z0-9 ]: alphanumeric \W = [ˆ\w]: non-alphanumeric \s = [\r\t\n\f]: whitespace character \r: space, \t: tab, \n: newline, \f: formfeed \S [ˆ\s]: non-whitespace 10

Some RE practice What does \$[0-9]+(\.[0-9][0-9]) signify? Write a RE to capture the times on a digital watch (hours and minutes). Think about: the (im)possible values for the hours the (im)possible values for the minutes 11

Formal language theory We will view any formal language as a set of strings The language uses a finite vocabulary Σ (called an alphabet), and a set of string-combining operations Regular languages are the simplest class of formal languages = class of languages definable by REs = class of languages characterizable by FSAs 12

Regular languages How can the class of regular languages which is specified by regular expressions be characterized? Let Σ be the set of all symbols of the language, the alphabet, then: 1. {} is a regular language 2. a Σ: {a} is a regular language 3. If L 1 and L 2 are regular languages, so are: (a) the concatenation of L 1 and L 2 : L 1 L 2 = {xy x L 1, y L 2 } (b) the union of L 1 and L 2 : L 1 L 2 (c) the Kleene closure of L: L = L 0 L 1 L 2... where L i is the language of all strings of length i. 13

Properties of regular languages (1) The regular languages are closed under (L 1 and L 2 regular languages): concatenation: L 1 L 2 set of strings with beginning in L 1 and continuation in L 2 Kleene closure: L 1 set of repeated concatenation of a string in L 1 union: L 1 L 2 set of strings in L 1 or in L 2 complementation: Σ L 1 set of all possible strings that are not in L 1 14

Properties of regular languages (2) The regular languages are closed under (L 1 and L 2 regular languages): difference: L 1 L 2 set of strings which are in L 1 but not in L 2 intersection: L 1 L 2 set of strings in both L 1 and L 2 reversal: L R 1 set of the reversal of all strings in L 1 15

What sorts of expressions aren t regular? In natural language, examples include center-embedding constructions. These dependencies are not regular: (1) a. The cat loves Mozart. b. The cat the dog chased loves Mozart. c. The cat the dog the rat bit chased loves Mozart. d. The cat the dog the rat the elephant admired bit chased loves Mozart. (2) (the noun) n (transitive-verb) n 1 loves Mozart Similar ones would be regular: (3) A*B* loves Mozart 16

Finite state machines Finite state machines (or automata) (FSM, FSA) recognize or generate regular languages, exactly those specified by regular expressions. Example: Regular expression: colou?r Finite state machine: 0 c 6 o 5 l 4 o 2 r u r 1 3 17

Accepting/Rejecting strings The behavior of an FSA is completely determined by its transition table. The assumption is that there is a tape, with the input symbols read off consecutive cells of the tape. The machine starts in the start (initial) state, about to read the contents of the first cell on the input tape. The FSA uses the transition table to decide where to go at each step A string is rejected in exactly two cases: 1. a transition on an input symbol takes you nowhere 2. the state you re in after processing the entire input is not an accept (final) state Otherwise. the string is accepted. 18

Defining finite state automata A finite state automaton is a quintuple (Q, Σ,E, S, F) with Q a finite set of states Σ a finite set of symbols, the alphabet S Q the set of start states F Q the set of final states E a set of edges Q (Σ {ǫ}) Q The transition function d can be defined as d(q,a) = {q Q (q,a,q ) E} 19

Example FSA FSA to recognize strings of the form: [ab]+ i.e., L = { a, b, ab, ba, aab, bab, aba, bba,... } FSA is defined as: Q = {0, 1} Σ = {a, b} S = {0} F = {1} E = {(0, a,1), (0, b,1), (1,a, 1),(1, b,1)} 20

FSA: set of zero or more a s L = { ǫ, a, aa, aaa, aaaa,... } Q = {0} Σ = {a} S = {0} F = {0} E = {(0, a,0)} 21

FSA: set of all lowercase alphabetic strings ending in b L captured by [a-z]*b = {b, ab, tb,..., aab, abb,... } Q = {0, 1} Σ = {a, b, c,...,z} S = {0} F = {1} E = {(0, a,0), (0, c,0),(0, d,0),...,(0,z,0) (0,b, 1),(1, b,1), (1,a, 0),(1, c,0),(1, d,0),...(1, z,0)} How would we change this to make it: \b[a-z]*b\b 22

FSA: the set of all strings in [ab]* with exactly 2 a s Do this yourself It might help to first rewrite a more precise regular expression for this First, be clear what the domain is (all string in [ab]*) And then figure out how to narrow it down 23

Language accepted by an FSA The extended set of edges Ê Q Σ Q is the smallest set such that (q,σ,q ) E : (q, σ,q ) Ê (q 0,σ 1, q 1 ), (q 1, σ 2, q 2 ) Ê : (q 0,σ 1 σ 2,q 2 ) Ê The language L(A) of a finite state automaton A is defined as L(A) = {w q s S, q f F,(q s, w,q f ) Ê} 24

FSA for simple NPs Where d is an alias for determiners, a for adjectives, and n for nouns: Q = {0, 1, 2} Σ = {d, a,n} S = {0} F = {2} E = {(0, d,1), (0,ǫ, 1)(1,a, 1),(1, n, 2),(2, n, 2)} 25

Finite state transition networks (FSTN) Finite state transition networks are graphical descriptions of finite state machines: nodes represent the states start states are marked with a short arrow final states are indicated by a double circle arcs represent the transitions 26

Example for a finite state transition network a S1 b S0 c S2 b Regular expression specifying the language generated or accepted by the corresponding FSM: ab cb+ b S3 27

Finite state transition tables Finite state transition tables are an alternative, textual way of describing finite state machines: the rows represent the states start states are marked with a dot after their name final states with a colon the columns represent the alphabet the fields in the table encode the transitions 28

The example specified as finite state transition table a b c d S0. S1 S2 S1 S3: S2 S2,S3: S3: 29

Some properties of finite state machines Recognition problem can be solved in linear time (independent of the size of the automaton). There is an algorithm to transform each automaton into a unique equivalent automaton with the least number of states. 30

Deterministic Finite State Automata A finite state automaton is deterministic iff it has no ǫ transitions and for each state and each symbol there is at most one applicable transition. Every non-deterministic automaton can be transformed into a deterministic one: Define new states representing a disjunction of old states for each non-determinacy which arises. Define arcs for these states corresponding to each transition which is defined in the non-deterministic automaton for one of the disjuncts in the new state names. 31

Example: Determinization of FSA c 2 3 d c a 7 e a c 1 e 4 5 6 b a e a d 2 c 3 4 {5,6} 5 e a {4,5} c a 7 c, a 3 6 1 {3,5} a b a 32

Why finite-state? Finite number of states Number of states bounded in advance determined by its transition table Therefore, the machine has a limit to the amount of memory it uses. Its behavior at each stage is based on the transition table, and depends just on the state its in, and the input. So, the current state reflects the history of the processing so far. Classes of formal languages which are not regular require additional memory to keep track of previous information, e.g., center-embedding constructions 33

From Automata to Transducers Needed: mechanism to keep track of path taken A finite state transducer is a 6-tuple (Q, Σ 1, Σ 2,E, S,F) with Q a finite set of states Σ 1 a finite set of symbols, the input alphabet Σ 2 a finite set of symbols, the output alphabet S Q the set of start states F Q the set of final states E a set of edges Q (Σ 1 {ǫ}) Q (Σ 2 {ǫ}) 34

Transducers and determinization A finite state transducer understood as consuming an input and producing an output cannot generally be determinized. Example: a:b 3 a:c a:b b:b c:c a:c 35

Summary Notations for characterizing regular languages: Regular expressions Finite state transition networks Finite state transition tables Finite state machines and regular languages: Definitions and some properties Finite state transducers 36