EDUCATIONAL PEARL. Proof-directed debugging revisited for a first-order version

Similar documents
CS:4330 Theory of Computation Spring Regular Languages. Finite Automata and Regular Expressions. Haniel Barbosa

Sri vidya college of engineering and technology

arxiv: v1 [cs.ds] 9 Apr 2018

Theory of Computation

Formal Languages. We ll use the English language as a running example.

Tasks of lexer. CISC 5920: Compiler Construction Chapter 2 Lexical Analysis. Tokens and lexemes. Buffering

Lecture Notes on Inductive Definitions

Peter Wood. Department of Computer Science and Information Systems Birkbeck, University of London Automata and Formal Languages

Theory of computation: initial remarks (Chapter 11)

Supplementary Notes on Inductive Definitions

Automata Theory and Formal Grammars: Lecture 1

Introduction to Automata

CSC173 Workshop: 13 Sept. Notes

Theory of computation: initial remarks (Chapter 11)

Regular Languages. Problem Characterize those Languages recognized by Finite Automata.

CS 455/555: Finite automata

CS243, Logic and Computation Nondeterministic finite automata

Theory of Computation

Lecture Notes on Inductive Definitions

Lecture 2: Connecting the Three Models

Parikh s theorem. Håkan Lindqvist

Johns Hopkins Math Tournament Proof Round: Automata

HKN CS/ECE 374 Midterm 1 Review. Nathan Bleier and Mahir Morshed

CS 121, Section 2. Week of September 16, 2013

What we have done so far

Proofs, Strings, and Finite Automata. CS154 Chris Pollett Feb 5, 2007.

CS 154. Finite Automata, Nondeterminism, Regular Expressions

September 11, Second Part of Regular Expressions Equivalence with Finite Aut

1 Alphabets and Languages

1 More finite deterministic automata

Theory of Computation

NOTES ON AUTOMATA. Date: April 29,

Introduction to the Theory of Computation. Automata 1VO + 1PS. Lecturer: Dr. Ana Sokolova.

Models of Computation. by Costas Busch, LSU

CPSC 421: Tutorial #1

PS2 - Comments. University of Virginia - cs3102: Theory of Computation Spring 2010

Kleene Algebras and Algebraic Path Problems

Theory of Computation p.1/?? Theory of Computation p.2/?? Unknown: Implicitly a Boolean variable: true if a word is

Non-deterministic Finite Automata (NFAs)

Introduction to the Theory of Computation. Automata 1VO + 1PS. Lecturer: Dr. Ana Sokolova.

COMP4141 Theory of Computation

Automata and Formal Languages - CM0081 Finite Automata and Regular Expressions

E.g. The set RE of regular expressions is given by the following rules:

Automata Theory. Lecture on Discussion Course of CS120. Runzhe SJTU ACM CLASS

Chaos, Complexity, and Inference (36-462)

State Complexity of Two Combined Operations: Catenation-Union and Catenation-Intersection

cse303 ELEMENTS OF THE THEORY OF COMPUTATION Professor Anita Wasilewska

Before We Start. The Pumping Lemma. Languages. Context Free Languages. Plan for today. Now our picture looks like. Any questions?

Mathematical Preliminaries. Sipser pages 1-28

Comment: The induction is always on some parameter, and the basis case is always an integer or set of integers.

Closure under the Regular Operations

Lecture 23 : Nondeterministic Finite Automata DRAFT Connection between Regular Expressions and Finite Automata

Finite Automata and Formal Languages TMV026/DIT321 LP4 2012

Einführung in die Computerlinguistik

Regular Expressions [1] Regular Expressions. Regular expressions can be seen as a system of notations for denoting ɛ-nfa

CS256/Spring 2008 Lecture #11 Zohar Manna. Beyond Temporal Logics

FORMAL LANGUAGES, AUTOMATA AND COMPUTABILITY

Equivalence of Regular Expressions and FSMs

CS 275 Automata and Formal Language Theory

Nondeterministic finite automata

Any Wizard of Oz fans? Discrete Math Basics. Outline. Sets. Set Operations. Sets. Dorothy: How does one get to the Emerald City?

3515ICT: Theory of Computation. Regular languages

Final exam study sheet for CS3719 Turing machines and decidability.

On the Average Complexity of Brzozowski s Algorithm for Deterministic Automata with a Small Number of Final States

Advanced Automata Theory 7 Automatic Functions

T (s, xa) = T (T (s, x), a). The language recognized by M, denoted L(M), is the set of strings accepted by M. That is,

This lecture covers Chapter 5 of HMU: Context-free Grammars

Formal Languages. We ll use the English language as a running example.

From EBNF to PEG. Roman R. Redziejowski

Takeaway Notes: Finite State Automata

(Refer Slide Time: 0:21)

Midterm Exam 2 CS 341: Foundations of Computer Science II Fall 2018, face-to-face day section Prof. Marvin K. Nakayama

September 7, Formal Definition of a Nondeterministic Finite Automaton

Concatenation. The concatenation of two languages L 1 and L 2

Languages. A language is a set of strings. String: A sequence of letters. Examples: cat, dog, house, Defined over an alphabet:

Chapter 0 Introduction. Fourth Academic Year/ Elective Course Electrical Engineering Department College of Engineering University of Salahaddin

Introduction to Finite-State Automata

Before we show how languages can be proven not regular, first, how would we show a language is regular?

Learning k-edge Deterministic Finite Automata in the Framework of Active Learning

FORMAL LANGUAGES, AUTOMATA AND COMPUTATION

Formal Models in NLP

Introduction to Formal Languages, Automata and Computability p.1/42

Finite Automata and Formal Languages

UNIT-III REGULAR LANGUAGES

Foundations of Informatics: a Bridging Course

Structural Induction

CS 154, Lecture 3: DFA NFA, Regular Expressions

Finite Automata Theory and Formal Languages TMV027/DIT321 LP4 2018

Partitions and Covers

ECS 120: Theory of Computation UC Davis Phillip Rogaway February 16, Midterm Exam

Recap DFA,NFA, DTM. Slides by Prof. Debasis Mitra, FIT.

Examples of Regular Expressions. Finite Automata vs. Regular Expressions. Example of Using flex. Application

Closure Properties of Regular Languages. Union, Intersection, Difference, Concatenation, Kleene Closure, Reversal, Homomorphism, Inverse Homomorphism

Computability Theory

Pushdown Automata: Introduction (2)

CS481F01 Solutions 8

Finite Universes. L is a fixed-length language if it has length n for some

CSE 105 Homework 1 Due: Monday October 9, Instructions. should be on each page of the submission.

UNIT-II. NONDETERMINISTIC FINITE AUTOMATA WITH ε TRANSITIONS: SIGNIFICANCE. Use of ε-transitions. s t a r t. ε r. e g u l a r

Theory of Computation p.1/?? Theory of Computation p.2/?? We develop examples of languages that are decidable

Transcription:

JFP 16 (6): 663 670, 2006. c 2006 Cambridge University Press doi:10.1017/s0956796806006149 First published online 14 September 2006 Printed in the United Kingdom 663 EDUCATIONAL PEARL Proof-directed debugging revisited for a first-order version KWANGKEUN YI School of Computer Science and Engineering, Seoul National University, Seoul, Korea (e-mail: kwang@ropas.snu.ac.kr) Abstract Some 10 years ago, Harper illustrated the powerful method of proof-directed debugging for developing programs with an article in this journal. Unfortunately, his example uses both higher-order functions and continuation-passing style, which is too difficult for students in an introductory programming course. In this pearl, we present a first-order version of Harper s example and demonstrate that it is easy to transform the final version into an efficient state machine. Our new version convinces students that the approach is useful, even essential, in developing both correct and efficient programs. 1 Introduction Harper (Harper, 1999) demonstrated the power of inductive reasoning in developing correct programs. To illustrate the principle, he used the example of regularexpression matching. Unfortunately, his example used functions in higher-order continuation-passing style (CPS), making it inaccessible to beginning students. If we wish to teach the principles and usefulness of inductive reasoning to a wide audience early in the curriculum, we must develop a simple version of the example. Ideally, the new version should rely only on first-order functions. Even more importantly, we must also address the efficiency of the functions. For beginning students, the gap between Harper s example with higher-order functions and the conventional regular expression matcher based on finite-state machines (Aho et al., 1986) is just too large. In this education pearl, we show how to overcome both problems. The resulting functions are first-order and the development uses nothing but inductive reasoning. The students in our first-year undergraduate programming course can readily follow the program and its development via proof-directed debugging. Furthermore, we also show how to reformulate the resulting functions as finite-state machines whose time complexity is linear in the input string size. In short, our new version can

664 K. Yi convince students that the idea is useful, even essential, in developing both correct and efficient programs. 1 The rest of the pearl proceeds as follows. Section 2 introduces the necessary background, section 3 the problem statement and a first solution. Following Harper, the correctness proof fails and exposes errors. We fix the errors with a program refinement that repeatedly applies inductive reasoning to the inputs. Finally, section 4 shows that it is straightforward to transform the proven first-order program into a finite-state machine that matches the string input in time linear in the string size. 2 Background Let Σ be an alphabet, that is, a finite set of letters. Weusecto denote a letter. Σ is the set of finite strings over the alphabet Σ. We use s to denote a string. The null string is written ɛ. ThesetΣ is inductively defined as s ɛ c s (c Σ) String concatenation of s and s is written s s. The empty string is the identity element for the concatenation operator, that is, ɛ s = s = s ɛ. A language L is a subset of Σ.Thesize s of string s is defined as ɛ =0and c s =1+ s. We use the following operations on languages: LL = {s s s L,s L } L 0 = {ɛ} L i+1 = LL i L = i 0 L i Regular expressions are notation for languages. The set of regular expressions is inductively defined as r ɛ c rr r +r r Each regular expression r denotes language L(r) inductively as follows: L(ɛ) = {ɛ} L(c) = {c} L(rr ) = L(r)L(r ) L(r + r ) = L(r) L(r ) L(r ) = L(r) We use L(R) also for a set R of regular expressions to denote r R L(r). L( ) is defined as. Thesize r of regular expression r is defined as: ɛ = c =1, rr = r + r = r + r +1,and r = r +1. 1 This pearl is not about teaching the transformation (Wand, 1980) of CPS into accumulator-style programs. Our point is to illustrate how to directly arrive at conventional first-order programs by means of only inductive reasoning.

Proof-directed debugging revisited for a first-order version 665 3 A regular expression matching algorithm 3.1 Problem and specification The regular expression matching problem is, for a string s Σ and a regular expression r, to determine whether s L(r). The inductive specification for r!s, which is true iff string s matches regular expression r, consists of five cases, following the definition of the regular-expression grammar: ɛ!s = s = ɛ c!s = s = c r 1 + r 2!s = r 1!s r 2!s (1) r 1 r 2!s = s 1 s 2 : s = s 1 s 2 r 1!s 1 r 2!s 2 (2) r!s = s = ɛ ( s 1 s 2 : s = s 1 s 2 r!s 1 r!s 2 ) (3) The problem is to find s s substrings s 1 and s 2 that satisfy the conditions for the last two cases. 3.2 Inductive refinement Our first step toward the implementation of (2) and (3) uses an inductive analysis of the string argument s ɛ c s (for c Σ): r!ɛ = True r!c s = r r!s for some r r c = False {r r!s r r c } (4) where L(r c )={s c s L(r)}. That is, r c denotes the set of regular expressions for the strings in r whose leading letter c has been removed. Analyzing r 1 r 2!s proceeds along the same lines: r 1 r 2!ɛ = r 1!ɛ r 2!ɛ (5) r 1 r 2!c s = r 1r 2!s for some r 1 r 1 c = False {r 1r 2!s r 1 r 1 c } (6) The definition of the function r c again follows the definition of the regularexpression grammar: ɛ c = c c = (c c ) c c = {ɛ} r 1 +r 2 c = r 1 c r 2 c (7) r 1 r 2 c = {r 1r 2 r 1 r 1 c } (8) r c = {r r r r c } (9) Are the definitions of r!s and r c correct? The natural way to approach the proofs is to proceed by induction. We first prove and debug the definition of r c and then that of r!s.

666 K. Yi 3.2.1 Inductive proof for r c First, the termination of r c is clear, because arguments to the recursive calls follow a well-founded order: every regular expression argument to recursive callees is smaller than that of the caller. From a finite regular expression there is no infinitely decreasing chain. Our proof proceeds by induction on the same order, that is on the size r of r. In proving the correctness of r c, the inductive hypothesis is that r c is correct for every r such that r < r. Our proof goal is to show that our definition of r c satisfy the specification: L(r c )={s c s L(r)}. Base cases: ɛ c and c c when c c must be because L(ɛ) orl(c ) has no string starting with c. Casec c must be {ɛ} because L(c) ={c}. Inductive cases: Byr 1 + r 2 c s definition (Eq. (7)), L(r 1 + r 2 c )isequaltol(r 1 c ) L(r 2 c ). It follows from the hypothesis (applied to r 1 and r 2 ) that L(r 1 c ) L(r 2 c ) = {s c s L(r 1 )} {s c s L(r 2 )} = {s c s L(r 1 + r 2 )}. Applying the r c s definition (Eq. (9)), L(r c )isl(r c )L(r ). By inductive hypothesis to r yields: L(r c )L(r ) = {s c s L(r)}L(r ) = {s c s L(r)L(r )} = {s c s L(r )}. What about r 1 r 2 c? By its definition (Eq. (8)), L(r 1 r 2 c )isl(r 1 c )L(r 2 ). By inductive hypothesis applied to r 1,itis{s c s L(r 1 )}L(r 2 ). Is this set equal to {s c s L(r 1 )L(r 2 )} so that we can conclude our proof? Unfortunately it is not; for example, in the case L(r 1 )={ɛ} and c s L(r 2 ), the latter set is nonempty, while the former set is empty. What to do? Following Harper (Harper, 1999), we may leave the definition as it is and assume a pre-processed input r, such that for every r 1 r 2 occurring in r, ɛ L(r 1 ). Another way to fix the problem is to inductively refine the definition one step further. Because the proof failure naturally suggests the consideration of the cases for r 1, we refine the definition of r 1 r 2 c by the inductive sub-cases for r 1 : cr 2 c = {r 2 } c r 2 c = (c c ) ɛr 2 c = r 2 c (10) (r 11 r 12 )r 2 c = r 11 (r 12 r 2 ) c (11) (r 11 + r 12 )r 2 c = r 11 r 2 c r 12 r 2 c (12) r 1r 2 c = r 2 c {r r 1r 2 r r 1 c } (13)

Proof-directed debugging revisited for a first-order version 667 The cases where L(r 1 ) can have ɛ are handled either by case analysis or by recursive calls. Now a new problem arises with Eq. (11) because we can no longer induct solely on the size of r. We have to find a different well-founded order for the recursive calls. Because, for the recursive call r 11 (r 12 r 2 ) c from (r 11 r 12 )r 2 c, the regular expression s left-hand side (Left(rr ) = let r) is decreasing, the arguments to recursive calls follow the order (r,c) > (r,c) iff r > r (for Eq. (7),(9),(10),(12),(13)) or ( r = r Left(r) > Left(r ) ) (for Eq. (11)) The order is well-founded; there is no infinitely decreasing chain from finite (r,c). Hence, our correctness proof proceeds by induction on the new order: the induction hypothesis in proving r c is that r c is correct for every (r,c) < (r,c). The proof proceeds smoothly, following the same pattern of reasoning as before. 3.2.2 Inductive proof for r!s First, the termination of r!s is clear, because the arguments to recursive calls follow a well-founded order: (r,s) > (r,s ) iff s > s (for Eq. (4),(6)) or ( s = s r > r ) (for Eq. (1),(5)) There is no infinitely decreasing chain from finite (r,s). Our proof proceeds by induction on this order. In proving r!s, the induction hypothesis is that r!s is correct for every (r,s ) < (r,s). Our proof goal is to show two things: r!s = True = s L(r) and r!s = False = s L(r). The base cases are trivial. Consider the inductive cases. If r!c s returns true, r r c : r r!s = True (by Eq. (4)). By inductive hypothesis applied to (r r,s), the assertion is equal to r r c : s L(r r )=True. Thus, c s c L(r r ) c L(r )L(r ) L(r)L(r ) since r r c L(r ). If r!c s is false, then by its definition (Eq. (4)), r c = or r r c : r r!s = False. If r c =, then by its correctness, = L(r c )={s c s L(r)}, i.e., c s L(r) hence c s L(r ). If r r c : r r!s = False, then by inductive hypothesis applied to s, r r c : s L(r r ). This means by the correctness of r c that s L(r c )L(r ). Hence c s {c}l(r c )L(r ). This means that c s has no common prefix with L(r) s strings with leading c. Thatis,c s L(r)L(r ). Thus c s L(r ).

668 K. Yi What about r 1 r 2!c s (Eq. (6))? When it returns true, the proof proceeds by a similar pattern of reasoning without a problem. When it returns false, it means that, by its definition, r 1 c = or r 1 r 1 c : r 1 r 2!s = False. Consider the case r 1 c =, which means that c s L(r 1 ). Can we conclude, from this, that c s L(r 1 r 2 )? No, because if ɛ L(r 1 )andc s L(r 2 ), it is possible that c s L(r 1 r 2 ). We also fix this problem with a refinement of the inductive definition with one more step, instead of transforming the given regular expression into its (equivalent) standard form. We refine the definition of r 1 r 2!c s by analyzing the five inductive sub-cases for r 1 : ɛr 2!c s = r 2!c s (14) cr 2!c s = r 2!s (15) (r 11 r 12 )r 2!c s = r 11 (r 12 r 2 )!c s (16) (r 11 + r 12 )r 2!c s = r 11 r 2!c s r 12 r 2!c s (17) r 1r 2!c s = r 2!c s {r (r 1r 2 )!s r r 1 c } (18) The termination is easy to see, because arguments to recursive calls follow the well-founded order : (r,s) > (r,s ) iff s > s (for Eq. (4),(15),(18)) or ( s = s r > r ) (for Eq. (1),(5),(14),(17),(18)) or ( s = s r = r Left(r) > Left(r ) ) (for Eq. (16)) Our correctness proof proceeds by induction on the order. The induction hypothesis in proving r!s is that r!s is correct for every (r,s ) < (r,s). The induction proof proceeds without a problem, following the same pattern of reasoning as before. Figure 1 displays the complete and provably correct definition of our pattern matching program. 4 Performance improvement by automaton construction Note that the worst-case number of recursive calls during r!s is exponential. Its recursive calls span a tree, because it sometimes invokes two or more recursive calls. The depth of the call tree is the length of the decreasing chain from the initial input (r,s). By the definition of the order (r,s ) < (r,s) of recursive call arguments, the length of the chain is proportional to the sum of s and a quantity proportional to r. Hence the number of recursive calls (i.e., nodes in the call tree) is exponential to this length. From the proven code, how can we achieve an efficient version like the automatonbased matcher (Aho et al., 1986)? By using the r c function. Given r, we can easily build a finite state machine that decides on s r in time linear in s. The automaton s states are regular expressions, one expression per state, denoting that the machine at state r expects strings in L(r). The machine s starting state is thus the input regular expression r, and its accepting states are those regular expressions whose languages contain ɛ. From each state r, we compute r c for each c Σ and draw an edge

Proof-directed debugging revisited for a first-order version 669 ɛ!s = s = ɛ c!s = s = c r 1 + r 2!s = r 1!s r 2!s r!ɛ = True r!c s = False {r r!s r r c } r 1 r 2!ɛ = r 1!ɛ r 2!ɛ ɛr 2!c s = r 2!c s cr 2!c s = r 2!s (r 11 r 12 )r 2!c s = r 11 (r 12 r 2 )!c s (r 11 + r 12 )r 2!c s = r 1 r 2!c s = r 11 r 2!c s r 12 r 2!c s r 2!c s {r (r 1 r 2)!s r r 1 c } ɛ c = c c = (c c ) c c = {ɛ} r 1 + r 2 c = r 1 c r 2 c r c = {r r r r c } cr 2 c = {r 2 } c r 2 c = (c c ) ɛr 2 c = r 2 c (r 11 r 12 )r 2 c = r 11 (r 12 r 2 ) c (r 11 + r 12 )r 2 c = r 11 r 2 c r 12 r 2 c r 1 r 2 c = r 2 c {r r 1 r 2 r r 1 c } Fig. 1. Correct implementation of r!s and r c. labeled c to r whenever r r c. The rationale behind this edge construction is obvious from the definition of r c : each regular expression in r c expects c-prefixed strings of L(r), but with the c being removed. We apply this procedure until no longer possible, transitively closing the states and edges. (This construction in a more general setting that covers the intersection and the complement operators is also presented in (Brzozowski, 1964) and later in (Berry & Sethi, 1986) with an efficiency improvement.) This automaton construction is finite. There are only finitely many states (regular expressions). A formal proof in a more general setting is in (Brzozowski, 1964), yet from our definition it is easy to see this finiteness. By r c s definition, the generated regular expression is always either ɛ, a sub-part of r, or an expanded one from r. But, because the expansion occurs only when r has a leading r 1 and the expansion is to prefix the input r 1 r 2 by r r 1 c, the number of the transitive expansions is bounded by the number of nested stars in r 1. The newly expanded one r r 1 r 2 cannot keep having a leading forever, because the prefix r is from r 1,onewith the outermost star of r 1 removed. For example, for regular expression a ab with alphabet {a, b}, the following states and edges are constructed (Figure 2): a ab with a goes to ɛa ab and b, because a ab a = ab a {ra ab r a a } = {b, ɛa ab}.

670 K. Yi a ɛa ab a start a ab a a b ɛ b Fig. 2. Non-deterministic automaton constructed from a ab by the daggar (r c ) operator. ɛa ab with a goes to b and itself, because b with b goes to ɛ, because b b = {ɛ}. ɛa ab a = a ab a = {b, ɛa ab}. A non-deterministic automaton can always be transformed into a deterministic one by the standard subset construction (Aho et al., 1986). 5 Conclusion Our educational pearl reformulates Harper s example of proof-directed debugging so that it becomes accessible to first-year students: We introduce a first-order version of the regular-expression matcher. We use inductive reasoning throughout program development: from the sketch of the algorithms to their correct completion. That is, refining the algorithms by repeatedly applying inductive analysis to a function s input structure fixes the errors. We are left with a small gap between theory and practice because it is easy to derive an efficient finite-state machine from the proven code. Based on our experience, this close connection to practice helps convince students of the practicality of the approach. Acknowledgements We thank especially Matthias Felleisen and anonymous referees for their very kind and detailed comments on our submission, and also Bob McKay, who gave helpful suggestionsonanearlierdraft. References Aho, A. V., Sethi, R. & Ullman, J. D. (1986) Compilers: Principles, techniques, and tools. Addison-Wesley. Berry, G. & Sethi, R. (1986) From regular expressions to deterministic automata. Theor. Comput. Sci. 48(1), 117 126. Brzozowski, J. A. (1964) Derivatives of regular expressions. J. ACM, 11(4), 481 494. Harper, R. (1999) Proof-directed debugging. J. Funct. Program. 9(4), 463 469. Wand, M. (1980) Continuation-based program transformation strategies. J. ACM, 27(1), 164 180.