How do regular expressions work? CMSC 330: Organization of Programming Languages

Similar documents
CMSC 330: Organization of Programming Languages

CMSC 330: Organization of Programming Languages. Regular Expressions and Finite Automata

Clarifications from last time. This Lecture. Last Lecture. CMSC 330: Organization of Programming Languages. Finite Automata.

CMSC 330: Organization of Programming Languages. Theory of Regular Expressions Finite Automata

CMSC 330: Organization of Programming Languages

Sri vidya college of engineering and technology

Regular languages, regular expressions, & finite automata (intro) CS 350 Fall 2018 gilray.org/classes/fall2018/cs350/

Closure under the Regular Operations

Deterministic Finite Automaton (DFA)

Recap from Last Time

CSE 105 Homework 1 Due: Monday October 9, Instructions. should be on each page of the submission.

Theory of computation: initial remarks (Chapter 11)

CS 455/555: Finite automata

TAFL 1 (ECS-403) Unit- II. 2.1 Regular Expression: The Operators of Regular Expressions: Building Regular Expressions

Automata: a short introduction

CS21 Decidability and Tractability

COM364 Automata Theory Lecture Note 2 - Nondeterminism

What Is a Language? Grammars, Languages, and Machines. Strings: the Building Blocks of Languages

CISC 4090 Theory of Computation

Computer Sciences Department

UNIT II REGULAR LANGUAGES

1 Alphabets and Languages

Automata & languages. A primer on the Theory of Computation. Laurent Vanbever. ETH Zürich (D-ITET) September,

3515ICT: Theory of Computation. Regular languages

CS:4330 Theory of Computation Spring Regular Languages. Finite Automata and Regular Expressions. Haniel Barbosa

Author: Vivek Kulkarni ( )

CS21 Decidability and Tractability

Computational Theory

CS5371 Theory of Computation. Lecture 5: Automata Theory III (Non-regular Language, Pumping Lemma, Regular Expression)

Concatenation. The concatenation of two languages L 1 and L 2

CS 154. Finite Automata, Nondeterminism, Regular Expressions

CS 154, Lecture 3: DFA NFA, Regular Expressions

CPS 220 Theory of Computation REGULAR LANGUAGES

Nondeterministic Finite Automata

CS311 Computational Structures Regular Languages and Regular Expressions. Lecture 4. Andrew P. Black Andrew Tolmach

T (s, xa) = T (T (s, x), a). The language recognized by M, denoted L(M), is the set of strings accepted by M. That is,

Languages, regular languages, finite automata

GEETANJALI INSTITUTE OF TECHNICAL STUDIES, UDAIPUR I

Formal Languages. We ll use the English language as a running example.

Regular Expressions. Definitions Equivalence to Finite Automata

CS243, Logic and Computation Nondeterministic finite automata

Uses of finite automata

1. (a) Explain the procedure to convert Context Free Grammar to Push Down Automata.

CISC 4090: Theory of Computation Chapter 1 Regular Languages. Section 1.1: Finite Automata. What is a computer? Finite automata

CS 154, Lecture 2: Finite Automata, Closure Properties Nondeterminism,

Regular expressions and Kleene s theorem

Computational Models Lecture 2 1

UNIT-II. NONDETERMINISTIC FINITE AUTOMATA WITH ε TRANSITIONS: SIGNIFICANCE. Use of ε-transitions. s t a r t. ε r. e g u l a r

Computational Models Lecture 2 1

Finite Automata Theory and Formal Languages TMV026/TMV027/DIT321 Responsible: Ana Bove

Lecture 17: Language Recognition

UNIT-III REGULAR LANGUAGES

Formal Languages. We ll use the English language as a running example.

Definition: A grammar G = (V, T, P,S) is a context free grammar (cfg) if all productions in P have the form A x where

FABER Formal Languages, Automata. Lecture 2. Mälardalen University

Automata Theory and Formal Grammars: Lecture 1

Unit 6. Non Regular Languages The Pumping Lemma. Reading: Sipser, chapter 1

Regular expressions and Kleene s theorem

HKN CS/ECE 374 Midterm 1 Review. Nathan Bleier and Mahir Morshed

What we have done so far

Finite Automata and Regular Languages

Proofs, Strings, and Finite Automata. CS154 Chris Pollett Feb 5, 2007.

Foundations of

Introduction to Theoretical Computer Science. Motivation. Automata = abstract computing devices

Course 4 Finite Automata/Finite State Machines

Lecture 3: Nondeterministic Finite Automata

Harvard CS 121 and CSCI E-207 Lecture 10: CFLs: PDAs, Closure Properties, and Non-CFLs

CS 154. Finite Automata vs Regular Expressions, Non-Regular Languages

Closure under the Regular Operations

Fooling Sets and. Lecture 5

CS 133 : Automata Theory and Computability

CMP 309: Automata Theory, Computability and Formal Languages. Adapted from the work of Andrej Bogdanov

cse303 ELEMENTS OF THE THEORY OF COMPUTATION Professor Anita Wasilewska

acs-04: Regular Languages Regular Languages Andreas Karwath & Malte Helmert Informatik Theorie II (A) WS2009/10

Regular Expressions and Language Properties

Compiler Design. Spring Lexical Analysis. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Theory of computation: initial remarks (Chapter 11)

CSE 311 Lecture 23: Finite State Machines. Emina Torlak and Kevin Zatloukal

FORMAL LANGUAGES, AUTOMATA AND COMPUTABILITY

CS 121, Section 2. Week of September 16, 2013

UNIT-I. Strings, Alphabets, Language and Operations

September 11, Second Part of Regular Expressions Equivalence with Finite Aut

PS2 - Comments. University of Virginia - cs3102: Theory of Computation Spring 2010

Automata & languages. A primer on the Theory of Computation. Laurent Vanbever. ETH Zürich (D-ITET) September,

Context Free Languages. Automata Theory and Formal Grammars: Lecture 6. Languages That Are Not Regular. Non-Regular Languages

Critical CS Questions

Theory of Computation

Decision, Computation and Language

Computational Models - Lecture 3

Text Search and Closure Properties

Intro to Theory of Computation

Theory of Computation (I) Yijia Chen Fudan University

Introduction to Language Theory and Compilation: Exercises. Session 2: Regular expressions

Text Search and Closure Properties

Chapter Five: Nondeterministic Finite Automata

CSC236 Week 10. Larry Zhang

Chapter 6. Properties of Regular Languages

The Pumping Lemma and Closure Properties

Compiler Design. Spring Lexical Analysis. Sample Exercises and Solutions. Prof. Pedro Diniz

Outline. Nondetermistic Finite Automata. Transition diagrams. A finite automaton is a 5-tuple (Q, Σ,δ,q 0,F)

Transcription:

How do regular expressions work? CMSC 330: Organization of Programming Languages Regular Expressions and Finite Automata What we ve learned What regular expressions are What they can express, and cannot Programming with them What s next: how they work Mixture of a very practical tool (string matching with REs) and some nice theory A great computer science result CMSC 330 1 CMSC 330 2 A Few Questions About REs How are REs implemented? Implementing a one-off RE is not so hard Ø How to do it in general? We ll see how to build a structure to parse REs What are the basic components of REs? Can implement some features in terms of others Ø E.g., we saw that e+ is the same as ee* What does a regular expression represent? Just a set of strings Ø This observation provides insight on how we go about our implementation Chapter 1-3 CMSC 330 4 1

Definition: Alphabet An alphabet is a finite set of symbols Usually denoted Σ Example alphabets: Binary: Σ = {0,1} Decimal: Σ = {0,1,2,3,4,5,6,7,8,9} Alphanumeric: Σ = {0-9,a-z,A-Z} CMSC 330 5 Definition: String A string is a finite sequence of symbols from Σ ε is the empty string ("" in Ruby) s is the length of string s Ø Hello = 5, ε = 0 Note Ø Ø is the empty set (with 0 elements) Ø Ø { ε } ε Example strings: 0101 Σ = {0,1} (binary) 0101 Σ = decimal 0101 Σ = alphanumeric CMSC 330 6 Definition: String concatenation String concatenation is indicated by juxtaposition If s 1 = super and s 2 = hero, then s 1s 2 = superhero Sometimes also written s 1 s 2 For any string s, we have sε = εs = s ou can concatenate strings from different alphabets; then the new alphabet is the union of the originals: Ø If s 1 = super Σ 1 = {s,u,p,e,r} and s 2 = hero Σ 2 = {h,e,r,o}, then s 1s 2 = superhero Σ 3 = {e,h,o,p,r,s,u} Definition: Language A language L is a set of strings over an alphabet Example: The set of phone numbers over the alphabet Σ = {0, 1, 2, 3, 4, 5, 6, 7, 9, (, ), -} Give an example element of this language Are all strings over the alphabet in the language? No Is there a Ruby regular expression for this language? /\(\d{3,3}\) \d{3,3}-\d{4,4}/ Example: The set of all strings over Σ Often written Σ* (123) 456-7890 CMSC 330 7 CMSC 330 8 2

Definition: Language (cont.) Example: The set of strings of length 0 over the alphabet Σ = {a, b, c} L = { s s Σ* and s = 0} = {ε} Ø Example: The set of all valid Ruby programs Is there a Ruby regular expression for this language? No. Matching (an arbitrary number of) brackets so that they are balanced is impossible using REs { { { } } } REs represent languages, but not all of them The languages represented by regular expressions are called, appropriately, the regular languages Definition: Regular Expressions Given an alphabet Σ, the regular expressions over Σ are defined inductively as regular expression Ø ε each element σ Σ denotes language Ø {ε} {σ} Constants CMSC 330 9 CMSC 330 10 Definition: Regular Expressions (cont.) Let A and B be regular expressions denoting languages L A and L B, respectively regular expression AB (A B) denotes language L AL B A* L A* L A L B Operations There are no other regular expressions over Σ Operations on Languages Let Σ be an alphabet and let L, L 1, L 2 be languages over Σ Concatenation L 1 L 2 is defined as L 1 L 2 = { xy x L 1 and y L 2 } Union is defined as L 1 L 2 = { x x L 1 or x L 2 } Kleene closure is defined as L* = { x x = ε or x L or x LL or x LLL or } CMSC 330 11 CMSC 330 12 3

Regular Expressions Denote Languages By applying operations on constants Generates a set of strings (i.e., a language) Examples Ø a { a } Ø a b { a } { b } = { a, b } Ø a* {ε} { a } { aa } = {ε, a, aa, } If s language generated by a RE r, we say that r accepts, describes, or recognizes string s Precedence Order in which operators are applied In arithmetic Ø Multiplication > addition + Ø 2 3 + 4 = (2 3) + 4 = 10 In regular expressions Ø Kleene closure * > concatenation > union Ø ab c = ( a b ) c = { ab, c } Ø ab* = a ( b* ) = { a, ab, abb } Ø a b* = a ( b* ) = { a,, b, bb, bbb } Can change order using parentheses ( ) Ø E.g., a(b c), (ab)*, (a b)* CMSC 330 13 CMSC 330 14 Regular Languages The languages that can be described using regular expressions are the regular languages or regular sets Not all languages are regular Examples (without proof): Ø The set of palindromes over Σ reads the same backward or forward Ø {a n b n n > 0 } (a n = sequence of n a s) Almost all programming languages are not regular But aspects of them sometimes are (e.g., identifiers) Regular expressions are commonly used in parsing tools CMSC 330 15 Ruby Regular Expressions Almost all of the features we ve seen for Ruby REs can be reduced to this formal definition /Ruby/ concatenation of single-character REs /(Ruby Regular)/ union /(Ruby)*/ Kleene closure /(Ruby)+/ same as (Ruby)(Ruby)* /(Ruby)?/ same as (ε (Ruby)) (// is ε) /[a-z]/ same as (a b c... z) / [^0-9]/ same as (a b c...) for a,b,c,... Σ - {0..9} ^, $ correspond to extra characters in alphabet CMSC 330 16 4

Implementing Regular Expressions We can implement a regular expression by turning it into a finite automaton A machine for recognizing a regular language Finite Automata Start state Transition on 1 Final state es No CMSC 330 17 States Machine starts in start or initial state Repeat until the end of the string is reached Scan the next symbol s of the string Take transition edge labeled with s String is accepted if automaton is in final state when end of string reached CMSC 330 18 Finite Automata: States Finite Automaton: Example 1 Start state State with incoming transition from no other state Can have only 1 start state Final states States with double circle Can have 0 or more final states S1 S2 Any state, including the start state, can be final 0 0 1 0 1 1 accepted CMSC 330 19 CMSC 330 20 5

Finite Automaton: Example 2 What Language is This? 0 0 1 0 1 0 not accepted All strings over {0, 1} that end in 1 What is a regular expression for this language? (0 1)*1 CMSC 330 21 CMSC 330 22 Finite Automaton: Example 3 string aabcc (a,b,c notation shorthand for three self loops) CMSC 330 23 acc bbc aabbb aa ε acba state at end S2 S2 S2 S1 S0 S0 S3 accepts? N Finite Automaton: Example 3 (cont.) What language does this FA accept? a*b*c* S3 is a dead state a nonfinal state with no transition to another state CMSC 330 24 6

Dead State: Shorthand Notation Finite Automaton: Example 4 If a transition is omitted, assume it goes to a dead state that is not shown is short for Language? Strings over {0,1,2,3} with alternating even and odd digits, beginning with odd digit CMSC 330 25 a*b*c* again, so DFAs are not unique CMSC 330 26 Finite Automaton: Example 5 Language? (a b)*abb Description for each state S0 = Haven t seen anything yet OR seen zero or more b s OR Last symbol seen was a b S1 = Last symbol seen was an a S2 = Last two symbols seen were ab S3 = Last three symbols seen were abb The Questions (for next time) Are FAs equivalent to regular expressions? Every FA can be translated into an RE Every RE can be translated into an FA es! How can we generate an FA for a given RE? If we can do this, we can implement RE matching How can we optimize an FA? Many FAs can implement the same language Some might be more efficient than others CMSC 330 28 7

Practice Give the English descriptions and the DFA or regular expression of the following languages: ((0 1)(0 1)(0 1)(0 1)(0 1))* All strings with length a multiple of 5 (01)* (10)* (01)*0 (10)*1 All alternating binary strings All binary strings containing the substring 11 Practice Give the regular expressions and finite automata for the following languages ou and your neighbors names All protein-coding DNA strings (including only ATCG and appearing in multiples of 3) All binary strings containing an even length substring of all 1 s All binary strings containing exactly two 1 s All binary strings that start and end with the same number CMSC 330 29 CMSC 330 30 Review Languages Sets of strings Operations on languages Regular expressions Constants Operators Precedence Finite automata States Transitions Accept strings CMSC 330 31 8