Noncanonical LALR(1) Parsing

Similar documents
CS 275 Automata and Formal Language Theory

Finite Automata Theory and Formal Languages TMV027/DIT321 LP4 2018

1.4 Nonregular Languages

Finite Automata. Informatics 2A: Lecture 3. John Longley. 22 September School of Informatics University of Edinburgh

Formal languages, automata, and theory of computation

Convert the NFA into DFA

CS 314 Principles of Programming Languages

1.3 Regular Expressions

Finite Automata. Informatics 2A: Lecture 3. Mary Cryan. 21 September School of Informatics University of Edinburgh

1 Nondeterministic Finite Automata

Anatomy of a Deterministic Finite Automaton. Deterministic Finite Automata. A machine so simple that you can understand it in less than one minute

AUTOMATA AND LANGUAGES. Definition 1.5: Finite Automaton

Theory of Computation Regular Languages. (NTU EE) Regular Languages Fall / 38

CS 275 Automata and Formal Language Theory

Assignment 1 Automata, Languages, and Computability. 1 Finite State Automata and Regular Languages

How to simulate Turing machines by invertible one-dimensional cellular automata

This lecture covers Chapter 8 of HMU: Properties of CFLs

Minimal DFA. minimal DFA for L starting from any other

General idea LR(0) SLR LR(1) LALR To best exploit JavaCUP, should understand the theoretical basis (LR parsing);

Theory of Computation Regular Languages

Handout: Natural deduction for first order logic

NFAs and Regular Expressions. NFA-ε, continued. Recall. Last class: Today: Fun:

Closure Properties of Regular Languages

Chapter 2 Finite Automata

More on automata. Michael George. March 24 April 7, 2014

Non-Deterministic Finite Automata. Fall 2018 Costas Busch - RPI 1

Overview HC9. Parsing: Top-Down & LL(1) Context-Free Grammars (1) Introduction. CFGs (3) Context-Free Grammars (2) Vertalerbouw HC 9: Ch.

CS415 Compilers. Lexical Analysis and. These slides are based on slides copyrighted by Keith Cooper, Ken Kennedy & Linda Torczon at Rice University

Designing finite automata II

CMSC 330: Organization of Programming Languages. DFAs, and NFAs, and Regexps (Oh my!)

Chapter Five: Nondeterministic Finite Automata. Formal Language, chapter 5, slide 1

Nondeterminism and Nodeterministic Automata

Non Deterministic Automata. Linz: Nondeterministic Finite Accepters, page 51

Context-Free Grammars and Languages

5.1 Definitions and Examples 5.2 Deterministic Pushdown Automata

The Regulated and Riemann Integrals

1. For each of the following theorems, give a two or three sentence sketch of how the proof goes or why it is not true.

Parse trees, ambiguity, and Chomsky normal form

FABER Formal Languages, Automata and Models of Computation

Coalgebra, Lecture 15: Equations for Deterministic Automata

Duality # Second iteration for HW problem. Recall our LP example problem we have been working on, in equality form, is given below.

Lecture 09: Myhill-Nerode Theorem

CS 275 Automata and Formal Language Theory

CS103B Handout 18 Winter 2007 February 28, 2007 Finite Automata

The transformation to right derivation is called the canonical reduction sequence. Bottom-up analysis

Intermediate Math Circles Wednesday, November 14, 2018 Finite Automata II. Nickolas Rollick a b b. a b 4

Finite Automata-cont d

Speech Recognition Lecture 2: Finite Automata and Finite-State Transducers

Formal Languages and Automata

Math 1B, lecture 4: Error bounds for numerical methods

Here we study square linear systems and properties of their coefficient matrices as they relate to the solution set of the linear system.

12.1 Nondeterminism Nondeterministic Finite Automata. a a b ε. CS125 Lecture 12 Fall 2016

For convenience, we rewrite m2 s m2 = m m m ; where m is repeted m times. Since xyz = m m m nd jxyj»m, we hve tht the string y is substring of the fir

The University of Nottingham SCHOOL OF COMPUTER SCIENCE A LEVEL 2 MODULE, SPRING SEMESTER LANGUAGES AND COMPUTATION ANSWERS

Recitation 3: More Applications of the Derivative

Talen en Automaten Test 1, Mon 7 th Dec, h45 17h30

12.1 Nondeterminism Nondeterministic Finite Automata. a a b ε. CS125 Lecture 12 Fall 2014

CISC 4090 Theory of Computation

CMSC 330: Organization of Programming Languages

Part 5 out of 5. Automata & languages. A primer on the Theory of Computation. Last week was all about. a superset of Regular Languages

Types of Finite Automata. CMSC 330: Organization of Programming Languages. Comparing DFAs and NFAs. Comparing DFAs and NFAs (cont.) Finite Automata 2

CSE : Exam 3-ANSWERS, Spring 2011 Time: 50 minutes

Some Theory of Computation Exercises Week 1

CSCI 340: Computational Models. Kleene s Theorem. Department of Computer Science

Nondeterminism. Nondeterministic Finite Automata. Example: Moves on a Chessboard. Nondeterminism (2) Example: Chessboard (2) Formal NFA

Speech Recognition Lecture 2: Finite Automata and Finite-State Transducers. Mehryar Mohri Courant Institute and Google Research

COMPUTER SCIENCE TRIPOS

CSC 473 Automata, Grammars & Languages 11/9/10

Properties of Integrals, Indefinite Integrals. Goals: Definition of the Definite Integral Integral Calculations using Antiderivatives

a b b a pop push read unread

Types of Finite Automata. CMSC 330: Organization of Programming Languages. Comparing DFAs and NFAs. NFA for (a b)*abb.

3 Regular expressions

5. (±±) Λ = fw j w is string of even lengthg [ 00 = f11,00g 7. (11 [ 00)± Λ = fw j w egins with either 11 or 00g 8. (0 [ ffl)1 Λ = 01 Λ [ 1 Λ 9.

Improper Integrals, and Differential Equations

Advanced Calculus: MATH 410 Notes on Integrals and Integrability Professor David Levermore 17 October 2004

1 Online Learning and Regret Minimization

UNIFORM CONVERGENCE. Contents 1. Uniform Convergence 1 2. Properties of uniform convergence 3

Strong Bisimulation. Overview. References. Actions Labeled transition system Transition semantics Simulation Bisimulation

CS375: Logic and Theory of Computing

Lecture 08: Feb. 08, 2019

CMPSCI 250: Introduction to Computation. Lecture #31: What DFA s Can and Can t Do David Mix Barrington 9 April 2014

p-adic Egyptian Fractions

Normal Forms for Context-free Grammars

19 Optimal behavior: Game theory

The First Fundamental Theorem of Calculus. If f(x) is continuous on [a, b] and F (x) is any antiderivative. f(x) dx = F (b) F (a).

PART 2. REGULAR LANGUAGES, GRAMMARS AND AUTOMATA

Automata and Languages

a,b a 1 a 2 a 3 a,b 1 a,b a,b 2 3 a,b a,b a 2 a,b CS Determinisitic Finite Automata 1

CS 330 Formal Methods and Models

Non Deterministic Automata. Formal Languages and Automata - Yonsei CS 1

Lecture 6 Regular Grammars

First Midterm Examination

Chapter 0. What is the Lebesgue integral about?

7.2 The Definite Integral

SWEN 224 Formal Foundations of Programming WITH ANSWERS

CHAPTER 1 Regular Languages. Contents

20 MATHEMATICS POLYNOMIALS

1. For each of the following theorems, give a two or three sentence sketch of how the proof goes or why it is not true.

Regular expressions, Finite Automata, transition graphs are all the same!!

Harvard University Computer Science 121 Midterm October 23, 2012

Transcription:

Noncnonicl LALR(1) Prsing Sylvin Schmitz Lbortoire I3S, Université de Nice - Sophi Antipolis & CNRS, Frnce schmitz@i3s.unice.fr Abstrct This pper ddresses the longstnding problem of the recognition limittions of clssicl LALR(1) prser genertors by proposing the usge of noncnonicl prsers. To this end, we present definition of noncnonicl LALR(1) prsers, NLALR(1). The clss of grmmrs ccepted by NLALR(1) prsers is proper superclss of the NSLR(1) nd LALR(1) grmmr clsses. Among the recognized lnguges re some nondeterministic lnguges. The proposed prsers retin mny of the qulities of cnonicl LALR(1) prsers: they re deterministic, esy to construct, nd run in liner time. We rgue tht they could provide the bsis for rnge of powerful noncnonicl prsers. Key words: Noncnonicl prser, deterministic prser, LALR, two-stck utomton ACM ctegories: D.3.1 [Progrmming Lnguges]: Forml Definitions nd Theory Syntx ; D.3.4 [Progrmming Lnguges]: Processors Prsing ; F.4.2 [Mthemticl Logic nd Forml Lnguges]: Grmmrs nd Other Rewriting Systems Prsing 1 Introduction Testimonies bound on the shortcomings of clssicl LALR(1) prser genertors like YACC [9]. The problem lies in the lrge expressivity gp between wht cn be specified using the context-free grmmr they re fed with, nd wht cn ctully be prsed by the LALR(1) utomton they produce. Trnsforming grmmr until its LALR(1) prser becomes deterministic is rduous, nd cn obfuscte the ttched semntics; moreover, some lnguges re simply not deterministic. The expressivity gp vnishes when generl prsers [6, 15] re preferred. Such choice is however done t the expense of the detection of mbiguities. While this might seem cceptble for well estblished lnguges, for which the scrutiny of mny implementors hs pinpointed ll mbiguous constructs, there lwys remins risk of runtime problems if n unexpected mbiguity ppers. The voidnce of such problems is clerly desirble gurntee, thus motivting our option of restricting to some subclss of the unmbiguous grmmrs. Also in shorter form in Zhe Dng nd Oscr H. Ibrr, editors, DLT 06, volume 4036 of Lecture Notes in Computer Science, pges 95 107. c Springer, 2006. doi: 10.1007/11779148 10 1

2 S. Schmitz This pper dvoctes n lmost forgotten wy of diminishing the expressivity gp: the usge of noncnonicl prsers. We pply it to LALR(1) prsing by mens of generic construction. Therefore, we lso llow immedite ppliction to other LR-bsed prsing methods. Noncnonicl prsers hve been thoroughly investigted on theoreticl level [12]. Surprisingly, there re very few prcticl noncnonicl prsing methods, nd their forml study remins lrgely unexplored. Indeed, the only one of cler prcticl interest is n extension to SLR(1) prsing [13]. Noncnonicl prsers re however powerful mens of reducing the expressivity gp, while still rejecting ny mbiguous syntx. In this they cn be compred to LALR(k) prsers with k > 1 [3], or, to lrger extent, to prsers llowing unbounded regulr lookheds [4, 2, 7]. Like the ltter, noncnonicl prsers cn recognize nondeterministic lnguges. The clsses of grmmrs ccepted by both methods re incomprble in generl, but the clss of lnguges ccepted by noncnonicl prsers is strictly wider thn the one ccepted by regulr lookhed prsers [12]. And there is winning rgument in fvor of noncnonicl prsers: they cn lso increse the size of their lookhed window, possibly to n unbounded length [8]. This point motivtes our study of noncnonicl LALR(1) prsers, since NSLR(1) prsers re unfit for such extensions: their lookhed computtion is not contextul. Also in contrst with NSLR(1), our definitions rely on prefix equivlence reltion: we use the LR(0) equivlence so tht the resulting prsers re LALR(1), but finer equivlences could just s esily be used. Our specific choice of LALR(1) prsers cn be explined by their wide doption, their prcticl relevnce, nd the existence of efficient nd brodly used lgorithms for their genertion [5]. We express our computtions in the sme frmework nd obtin simple nd efficient prcticl construction. The dditionl complexity of generting NLALR(1) prser insted of LALR(1) or NSLR(1) one, s well s the increse of the prser size nd the overhed on prsing performnces re ll quite smll. Therefore, the improved prsing power comes t firly resonble price. The pper is orgnized s follows: Section 2 briefly introduces noncnonicl prsing; Section 3 reclls the forml detils of the cnonicl LALR(1) definition, which will be extended for its noncnonicl counterprt in Section 4. We refer the interested reder to seprte reserch report [10] for complete study, including grmmr clsses comprisons, lterntive definitions for noncnonicl LALR-bsed prsers, concrete exmple of ppliction, nd omitted proofs. Nottion The bsic terminology, definitions, nd nottionl conventions used in this pper re clssicl [1, 11]. Our context-free grmmrs re reduced nd ugmented to G = N,T,P,S = N {S },T {$},P {S S$},S. As usul, A,B,C,... denote nonterminls in N ;,b,c,... denote terminls in T ; u,v,w,... denote strings in T ; X,Y,Z denote symbols in V ; α,β,γ,... denote strings in V ; ε is the empty string or empty sequence; k : α is the prefix of length k of string α. Rightmost derivtions re denoted by rm, wheres leftmost derivtions re denoted by lm.

Noncnonicl LALR(1) Prsing 3 q 0: S S$ S BC S AD A B S B A ccept q 1: q A 5: A {b, } B {} $ q 6: S BC {$} C C A A q q 11: 2: A S C C CA S $ q 3: S B C C CA C A A q 4: S A D D D D b A b D q 7: C A q 8: D D D D D b q 10: S AD () LALR(1) utomton b D q 9: D b q 12: D D S$ BC CA CA S$ AD D A D b (b) Derivtion trees Figure 1: The conflict position in stte q 1 for G 1. 2 Noncnonicl Prsing A bottom-up prser reverses the derivtion steps which led to the terminl string it prses. For most bottom-up prsers, including LALR ones, these derivtions re rightmost, nd therefore the reduced phrse is the leftmost one, clled the hndle of the sententil form. Noncnonicl prsers llow the reduction of phrses which my not be hndles [1]. A noncnonicl prser is ble to suspend reduction decision where its cnonicl counterprt would not be deterministic, explore the remining input, perform some reductions, resume to the conflict point nd use nonterminls resulting from the reduction of possibly unbounded mount of input in its lookhed window to infer its prsing decisions. 2.1 Prsing Exmple Consider for instnce grmmr G 1 with rules S BC AD, A, B, C CA A, D D b, generting the lnguge L G1 = + b. The stte q 1 in the utomton of Figure 1 is indequte: the prser is unble to decide between reductions A nd B when the lookhed is. We see on the derivtion trees of Figure 1b tht, in order to choose between the two reductions, the prser hs to know if there is b t the very end of the input. This need for n unbounded lookhed mkes G 1 non-lr. A prser using regulr lookhed would solve the conflict by ssociting the distinct regulr lookheds b nd + $ with the reductions to A nd B respectively. However, we notice tht single lookhed symbol (D or C) is enough: if the prser is ble to explore the context on the right of the conflict, nd to

4 S. Schmitz prsing stck input stck ctions q 0 $ shift q 0 q 1 $ shift The indequte stte q 1 is reched with lookhed. The decision of reducing to A or B cn be restted s the decision of reducing the right context to D or C. In order to perform the ltter decision, we shift nd rech stte s 1 where we now expect b nd $. We re pretty much in the sme sitution s before: s 1 is lso indequte. But we know tht in front of b or $ decision cn be mde: q 0 q 1 s 1 $ shift There is new conflict between the reduction A nd the shift of to position D D. We lso shift this. The expected right contexts re still b nd $, so the shift brings us gin to s 1 : q 0 q 1 s 1 s 1 $ reduce using A The decision is mde in front of $. We reduce the represented by s 1 on top of the prsing stck, nd push the reduced symbol A on top of the input stck: q 0 q 1 s 1 A$ reduce using A Using this new lookhed, the prser is ble to decide nother reduction to A: q 0 q 1 AA$ reduce using B We re now bck in stte q 1. Clerly, there is no need to wit until we see completely reduced symbol C in the lookhed window: A is lredy symbol specific to the reduction to B: q 0 BAA$ shift q 0 q 3 AA$ shift q 0 q 3 q 7 A$ reduce using C A q 0 q 3 CA$ shift q 0 q 3 q 6 A$ shift q 0 q 3 q 6 q 11 $ reduce using C CA q 0 q 3 C$ shift q 0 q 3 q 6 $ reduce using S BC q 0 S$ shift, nd then ccept Tble 1: The prse of the string by the NLALR(1) prser for G 1. reduce some other phrses, then, it will reduce this context to D or C. When coming bck to the conflict point, it will see D or C in the lookhed window. Tble 1 presents noncnonicl prse for string in L G1. The noncnonicl mchine is not very different from the cnonicl one, except tht it uses two stcks. The dditionl stck, the input stck, contins the (possibly reduced) right context, wheres the other stck is the clssicl prsing stck. Reductions push the reduced nonterminl on top of the input stck. There is no goto opertion per se: the nonterminl on top of the input stck either llows prsing decision which hd been delyed, or is simply shifted. We will now see how to trnsform nd extend the cnonicl LALR(1) prser of Figure 1 to perform these prsing steps. 2.2 Construction Principles The LALR(1) construction relies hevily on the LR(0) utomton. This utomton provides nice explntion for LALR lookhed sets: the symbols

Noncnonicl LALR(1) Prsing 5 q 1: A {D, b} B {C, A} s 1 = {q 5, q 8}: A {A, $} D D D D D b b D q 9 q 12 Figure 2: Stte q 1 extended for noncnonicl prsing. in the lookhed set for some reduction re the symbols expected next by the LR(0) prser, should it relly perform this reduction. Let us compute the lookhed set for the reduction A in stte q 1. Should the LR(0) prser decide to reduce A, it would pop q 1 from the prsing stck (thus be in stte q 0 ), nd then push q 4. We red directly on Figure 1 tht three symbols re cceptble in q 4 : D, nd b. Similrly, the reduction B in q 1 hs {C,A,} for lookhed set, red directly from stte q 3. The intersection of the lookhed sets for the reductions in q 1 is not empty: ppers in both, which mens conflict. Luckily enough, is not totlly reduced symbol: D nd C re reduced symbols, red from kernel items in q 4 nd q 3. The conflicting lookhed symbol could be reduced, nd lter we might see symbol on which we cn mke decision insted. Thus, we shift the lookhed symbol in order to reduce it nd solve the conflict lter. All the other symbols in the computed lookheds llow to mke decision, so we leve them in the lookheds sets, but we remove from both sets. Shifting puts us in the sme sitution we would hve been in if we hd followed the trnsitions on from both q 3 nd q 4, since the noncnonicl genertion simultes both reductions in q 1. We crete noncnonicl trnsition from q 1 on to noncnonicl stte s 1 = {q 5,q 8 }, which will behve s the union of sttes q 5 nd q 8. Stte s 1 will thus llow reduction using A inherited from q 5, nd the shifts of, b nd D inherited from q 8. We therefore need to compute the lookheds for reduction using A in q 5. Using gin the LR(0) simultion technique, we see on Figure 1 tht this reduction would led us to either q 7 or to q 11. In both cses, the LR(0) utomton would perform reduction to C tht would led next to q 6. At this point, the LR(0) utomton expects either the end of file symbol $, should reduction to S occur, or n A or n. The complete lookhed set for the reduction A in q 8 is thus {A,,$}. The new stte s 1 is lso indequte: with n in the lookhed window, we cnnot choose between the shift of nd the reduction A. As before, we crete new trnsition on from s 1 to noncnonicl stte s 1 = {q 5,q 8 }. Stte q 5 is the stte ccessed on from q 6. Stte q 8 is the stte ccessed from q 8 if we simulte shift of symbol. Stte s 1 is the sme s stte s 1, nd we merge them. The noncnonicl computtion is now finished. Figure 2 sums up how stte q 1 hs been trnsformed nd extended. Note tht we just use the set {q 5,q 8 } in noncnonicl LALR(1) utomton; items represented in Figure 2 re only there to ese understnding. 3 LALR(1) Prsers LALR prsers were introduced s prcticl prsers for deterministic lnguges. Rther thn building n exponentil number of LR(k) sttes, LALR(k) prsers dd lookhed sets to the ctions of the smll LR(0) prser. We briefly recll

6 S. Schmitz some importnt definitions nd results on LR(0) nd LALR(1) prsers. Vlid Items nd Prefixes A dotted production A α β of G is vlid LR(0) item for string γ in V if S rm δaz rm δαβz = γβz. (1) If such derivtion holds in G, then γ in V is vlid prefix. The set of vlid items for given string γ in V is denoted by Vlid(γ). Two strings δ nd γ re equivlent if nd only if they hve the sme vlid items. The vlid item sets re obtined through the following computtions: Kernel(ε)={S S$}, (2) Kernel(γX)={A αx β A α Xβ Vlid(γ)}, (3) Vlid(γ)=Kernel(γ) {B ω A α Bβ Vlid(γ)}. (4) LR(0) Sttes LR utomt re pushdown utomt tht use equivlence clsses on vlid prefixes s their stck lphbet Q. We therefore denote explicitly sttes of LR prser s q = [δ], where δ is some vlid prefix in q the stte reched upon reding this prefix. For instnce, in the utomton of Figure 1, stte q 2 is the equivlence clss {S}, while stte q 8 is the equivlence clss described by the regulr lnguge A. A pir ([δ],x) in Q V is trnsition if nd only if δx is vlid prefix. If this is the cse, then [δx] is the stte ccessed upon reding δx, thus the nottion [δx] lso implies 1 trnsition from [δ] on X, nd [δα] pth on α. LALR(1) Automt The LALR(1) lookhed set of reduction using A α in stte q is LA(q,A α) = {1:z S rm δaz nd q = [δα]}. (5) 4 NLALR(1) Prsers There is number of differences between the LALR(1) nd NLALR(1) definitions. The most visible one is tht we ccept nonterminls in our lookhed sets. We lso wnt to know which lookhed symbols re totlly reduced. Finlly, we re dding new sttes, which re sets of LR(0) sttes. Therefore, the objects in most of our computtions will be LR(0) sttes. 4.1 Vlid Covers We hve reclled in the previous section tht LR(0) sttes cn be viewed s collections of vlid prefixes. A similr definition for NLALR(1) sttes would be nice. However, due to the suspended prsing ctions, the lnguge of ll prefixes ccepted by noncnonicl prser is no longer regulr lnguge. This mens the prser will only hve regulr pproximtion of the exct prsing stck lnguge. The noncnonicl sttes, being sets of LR(0) sttes (i.e., sets of equivlence clsses on vlid prefixes), provide this pproximtion. We therefore define vlid covers s vlid prefixes covering the prsing stck lnguge. 1 We lwys ssume when writing [δx] tht Vlid(δX) is not the empty set.

Noncnonicl LALR(1) Prsing 7 Definition 1 String γ is vlid cover in G for string δ if nd only if γ is vlid prefix nd γ δ. We write ˆδ to denote some cover of δ nd Cover(L) to denote the set of ll vlid covers for the set of strings L. Remember for instnce configurtion q 0 q 1 $ from Tble 1. This configurtion leds to pushing stte s 1 = {q 5,q 8 }, where both vlid prefixes (B BC) nd A of q 5 nd q 8 re vlid covers for the ctul prsing stck prefix. Thus in s 1 we cover the prsing stck prefix by (B BC A ). 4.2 Noncnonicl Lookheds Noncnonicl lookheds re symbols in V. Adpting the computtion of the LALR(1) lookhed sets is simple, but few points deserve some explntions. First of ll, noncnonicl lookhed symbols hve to be non null, i.e. X is non null if X x. Indeed, null symbols do not provide ny dditionl right context informtion worse, they cn hide it. If we consider tht we lwys perform reduction t the erliest prsing stge possible, then they will never pper in lookhed window. Totlly Reduced Lookheds Totlly reduced lookheds form subset of the noncnonicl lookhed set such tht none of its elements cn be further reduced. A conflict with totlly reduced symbol s lookhed of reduction cnnot be solved by noncnonicl explortion of the right context, since there is no hope of ever reducing it ny further. We define here totlly reduced lookheds s non null symbols which cn follow the right prt of the offending rule in leftmost derivtion. Definition 2 The set of totlly reduced lookheds for reduction A α in LR(0) stte q is defined by RLA(q,A α) = {X S zaγxω,γ ε,x x, nd q = [ẑα]}. lm Derived Lookheds The derived lookhed symbols re simply defined by extending (5) to the set of ll non null symbols in V. Definition 3 The set of derived lookheds for reduction A α in LR(0) stte q is defined by DLA(q,A α) = {X S δaxω,x x, nd q = [ˆδα]}. We obviously hve tht LA(q,A α) = DLA(q,A α) T. (6) Conflicting Lookhed Symbols Lst, we need to compute which lookhed symbols would mke the stte indequte. A noncnonicl explortion of the right context is required for these symbols. They pper in the derived lookhed sets of severl reductions nd/or re trnsition lbels. However, the totlly reduced lookheds of reduction re not prt of this lookhed set, for if they re involved in conflict, then there is no hope of being ble to solve it.

8 S. Schmitz Definition 4 Conflicts lookhed set for reduction using A α in set s of LR(0) sttes is defined s CLA(s,A α) = {X DLA(q,A α) q s,x RLA(q,A α), (q,x) or ( p s, B β A α P,X DLA(p,B β))}. We then define the noncnonicl lookhed set for reduction using A α in set s of LR(0) sttes s NLA(s,A α) = ( DLA(q,A α) ) CLA(s,A α). q s We illustrte these definitions by computing the lookhed sets for the reduction using A in stte s 1 = {q 5,q 8 } s in Section 2.2: RLA(q 5,A ) = {A,$}, DLA(q 5,A ) = {A,,$}, CLA(s 1,A ) = {} nd NLA(s 1,A ) = {A,$}. 4.3 Noncnonicl Sttes We sid t the beginning of this section tht sttes in the NLALR(1) utomton were in fct sets of LR(0) sttes. We denote by δ the noncnonicl stte ccessed upon reding string δ in V. Definition 5 Noncnonicl stte δ is the set of LR(0) sttes defined by ε ={[ε]} nd δx = {[ ˆγAX] X CLA( δ,a α),[ˆγα] δ } {[ϕx] [ϕ] δ }. Noncnonicl trnsition from δ to δx on symbol X, denoted by ( δ,x), exists if nd only if δx. Reduction ( δ,a α) exists if nd only if there exists reduction (q,a α) nd q is in δ. Note tht these definitions remin vlid for plin LALR(1) sttes since, in bsence of conflict, noncnonicl stte is singleton set contining the corresponding LR(0) stte. A simple induction on the length of δ shows tht the LR(0) sttes considered in the noncnonicl stte δ provide vlid cover for ny ccessing string of the noncnonicl stte. It bsiclly mens tht the ctions decided in given noncnonicl stte mke sense t lest for cover of the rel sententil form prefix tht is red. The pproximtions done when covering the ctul sententil form prefix re mde on top of the previous pproximtions: with ech new conflict, we need to find new set of LR(0) sttes covering the prsing stck contents. This stcking is mde obvious in the bove definition when we write ˆγAX. It mens tht NLALR(1) prsers re not prefix vlid, but prefix cover vlid. Throughout this pper, we use the LR(0) utomton to pproximte the prefix red so fr. We could use more powerful methods but it would not relly be in the spirit of LALR prsing ny longer; see [10] for lterntive methods. 4.4 NLALR(1) Automt Here we formlize noncnonicl LALR(1) prsing mchines. They re specil cse of two-stck pushdown utomt (2PDA). As sid before, the dditionl

Noncnonicl LALR(1) Prsing 9 stck serves s n input for the prser, nd reductions push the reduced nonterminl on top of this stck. This behvior of reductions excepted, the definition of NLALR(1) utomton is similr to the LALR(1) one. Definition 6 Let M = (Q V {$, },R) be rewriting system. A configurtion of M is string of the form ε X 1... X 1...X n ω$ where X 1...X n nd ω re strings in V. We sy tht M is NLALR(1) utomton if its initil configurtion is ε w$ with w the input string in T, its finl configurtion is ε S $, nd if ech rewriting rule in R is of the form shift X in stte δ, defined if there is trnsition ( δ,x) δ X shift δ δx, or reduce by rule A X 1...X n of P in stte δx 1...X n with lookhed X, defined if A X 1...X n is reduction in δx 1...X n nd lookhed X is in NLA( δx 1...X n,a X 1...X n ) δx 1... δx 1...X n X A X1... Xn AX. The following rules illustrte Definition 6 on stte s 1 of the NLALR(1) utomton for G 1 : s 1 shift s 1 s 1, s 1 b shift s 1 {q 9 }, s 1 D shift s 1 {q 12 }, s 1 A AA nd s A 1 $ A A$. According to Definition 6, NLALR(1) utomt re ble to bcktrck by limited mount, corresponding to the length of their window, t reduction time only. We know tht noncnonicl prsers using bounded lookhed window operte in liner time [12]; the following theorem precisely shows tht the totl number of rules involved in the prsing of n input string is liner in respect with the number of reductions performed, which itself is liner with the input string length. This theorem uses n output effect τ which outputs the rules used for ech reduction performed by M; we then cll (M,τ) NLALR(1) prser. Theorem 1 Let G be grmmr nd (M,τ) its NLALR(1) prser. If π is prse of w in M, then the number of prsing steps π is relted to the number τ(π) of derivtions producing w in G nd to the length w of w by π = 2 τ(π) + w. Since ll the conflict lookhed symbols re removed from the noncnonicl lookhed sets NLA, the only possibility for the noncnonicl utomton to be nondeterministic would be to hve totlly reduced symbol cusing conflict. A context-free grmmr G is NLALR(1) if its NLALR(1) utomton is deterministic, nd thus if no totlly reduced symbol cn cuse conflict. 4.5 Computing the Lookheds nd Covers The LALR(1) lookhed sets tht re defined in Eqution (5) cn be expressed using the following definitions [5], where lookbck is reltion between reductions nd nonterminl LR(0) trnsitions, includes nd reds re reltions

10 S. Schmitz between nonterminl LR(0) trnsitions, nd DR stnding for directly reds is function from nonterminl LR(0) trnsitions to sets of lookhed symbols. ([δα],a α) lookbck ([δ],a), (7) ([δβ],a) includes ([δ],b) iff B βaγ nd γ ε, (8) ([δ],a) reds ([δa],c) iff ([δa],c) nd C ε, (9) DR([δ],A) = { ([δa],)}. (10) Using the bove definitions, we cn rewrite Eqution (5) s LA(q,A α) = (q,a α) lookbck includes reds (r,c) DR(r,C). (11) This computtion for LALR(1) lookhed sets is highly efficient. It cn entirely be performed on the LR(0) utomton, nd the union cn be interleved with fst trnsitive closure lgorithm [14] on the includes nd reds reltions. Since we hve very efficient nd widely dopted computtion for the cnonicl LALR(1) lookhed sets, why not try to use it for the noncnonicl ones? Theorem 2 RLA(q,A α) = {X X x,ψ ε,c ρb ψxσ Kernel(δρB) nd (q,a α) lookbck includes ([δρ],b)}. This theorem is consistent with the description of Section 2.2, where we sid tht C ws totlly reduced lookhed for reduction B in q 1 : item S B C is in the kernel of stte q 3 ccessed by (q 0,B), nd (q 1,B ) lookbck (q 0,B). Theorem 3 Let us extend the directly reds function of (10) to DR([δ],A) = {X ([δa],x) nd X x}; then DLA(q, A α) = DR(r,C). (q,a α) lookbck includes reds (r,c) We re still consistent with the description of Section 2.2 since, using this new definition of the DR function, DR(q 0,B) is {,C,A}. To find the vlid covers tht pproximte sententil form prefix using the LR(0) utomton nd to find the LALR lookhed sets wind up being very similr opertions. This llows us to reuse our reltionl computtions for the utomton construction itself, s illustrted by the following theorem. Theorem 4 Noncnonicl stte δ is the set of LR(0) sttes defined by ε ={[ε]} nd δx = {[γcx] X CLA( δ,a α),q δ nd (q,a α) lookbck includes reds ([γ],c)} {[ϕx] [ϕ] δ }.

Noncnonicl LALR(1) Prsing 11 4.6 Prcticl Construction Steps We present here more informl construction, with the min steps leding to the construction of NLALR(1) prser, given the LR(0) utomton. 1. Associte noncnonicl stte s={q} with ech LR(0) stte q. 2. Iterte while there exists n indequte 2 stte s: () if it hs not been done before, compute the RLA nd DLA lookhed sets for the reductions involved in the conflict; sve their vlues for the reduction nd LR(0) stte involved; (b) compute the CLA nd NLA lookhed sets for s; (c) set the lookheds to NLA for the reduction ctions in s; (d) if the NLA lookhed sets leve the stte indequte, mening there is conflict on totlly reduced lookhed, then report the conflict, nd use conflict resolution policy or terminte with n error; if CLA is not empty, crete trnsitions on its symbols nd crete new sttes if no fusion occurs. New sttes get new trnsition nd reduction sets computed from the LR(0) sttes they contin. If these new sttes result from shift/reduce conflicts, the trnsitions from s on the conflicting lookhed symbol now led to the new sttes. This process lwys termintes since there is bounded number of LR(0) sttes nd thus bounded number of noncnonicl sttes. Let us conclude this section with few words on the size of the generted prsers. Since NLALR(1) sttes re sets of LR(0) sttes, we find n exponentil function of the size of the LR(0) utomton s n upper bound on the size of the NLALR(1) utomton. This bound seems however pretty irrelevnt in prctice. The NLALR(1) prser genertor needs to crete new stte for ech lookhed cusing conflict, which does not hppen so often. All the grmmrs we studied creted trnsitions to cnonicl sttes very quickly fterwrds. Experimentl results with NSLR(1) prsers show tht the increse in size is negligible in prctice [13]. 5 Conclusion We hve presented construction for noncnonicl LALR(1) prsers. Such prsers re prcticl for some difficult syntx problems. They improve on both noncnonicl SLR(1) prsers nd cnonicl LALR(1) prsers, nd their genertion is only slightly more complex while their size nd their performnces re comprble. For prcticl uses, we feel we would need n unbounded lookhed version of NLALR prsers. Though the cost to py might be qudrtic prsing time in the worst cse, the freedom offered to the grmmr writer would probbly be yet. 2 We men here indequte in the LR(0) sense, thus no lookheds need to be computed

12 S. Schmitz worth it. The bility to specify finer equivlence reltions insted of the LR(0) one would prove its usefulness in this setting where precision becomes criticl. In complement to previous theoreticl work on noncnonicl prsing [12], it would be interesting to formlly study prcticl noncnonicl prsers. To this end, we expect the concept of vlid covers modulo n equivlence reltion to be good strting point. Acknowledgements The uthor is highly grteful to Jcques Frré nd An Almeid Mtos for their invluble help in the preprtion of this pper. References [1] Alfred V. Aho nd Jeffrey D. Ullmn. The Theory of Prsing, Trnsltion, nd Compiling. Volume I: Prsing. Series in Automtic Computtion. Prentice Hll, 1972. ISBN 0-13-914556-7. URL http://portl.cm.org/cittion.cfm?id=series11430.578789. [2] Mnuel E. Bermudez nd Krl M. Schimpf. Prcticl rbitrry lookhed LR prsing. Journl of Computer nd System Sciences, 41(2):230 250, 1990. ISSN 0022-0000. doi: 10.1016/0022-0000(90)90037-L. [3] Philippe Chrles. A Prcticl method for Constructing Efficient LALR(k) Prsers with Automtic Error Recovery. PhD thesis, New York University, My 1991. URL http://jikes.sourceforge.net/documents/thesis.pdf. [4] Krel Čulik nd Rin Cohen. LR-Regulr grmmrs n extension of LR(k) grmmrs. Journl of Computer nd System Sciences, 7(1):66 96, 1973. ISSN 0022-0000. doi: 10.1016/S0022-0000(73)80050-9. [5] Frnk DeRemer nd Thoms Pennello. Efficient computtion of LALR(1) look-hed sets. ACM Trnsctions on Progrmming Lnguges nd Systems, 4(4):615 649, 1982. ISSN 0164-0925. doi: 10.1145/69622.357187. [6] Jy Erley. An efficient context-free prsing lgorithm. Communictions of the ACM, 13(2):94 102, 1970. ISSN 0001-0782. doi: 10.1145/362007.362035. [7] Jcques Frré nd José Fortes Gálvez. A bounded-connect construction for LR-Regulr prsers. In Reinhrd Wilhelm, editor, CC 01, volume 2027 of Lecture Notes in Computer Science, pges 244 258. Springer, 2001. URL http://springerlink.com/content/e3e8g77kxevkyjfd. [8] Jcques Frré nd José Fortes Gálvez. Bounded-connect noncnonicl discriminting-reverse prsers. Theoreticl Computer Science, 313(1):73 91, 2004. ISSN 0304-3975. doi: 10.1016/j.tcs.2003.10.006. [9] Stephen C. Johnson. YACC yet nother compiler compiler. Computing science technicl report 32, AT&T Bell Lbortories, Murry Hill, New Jersey, July 1975.

Noncnonicl LALR(1) Prsing 13 [10] Sylvin Schmitz. Noncnonicl LALR(1) prsing. Technicl Report I3S/RR-2005-21-FR, Lbortoire I3S, November 2005. URL http://www.i3s.unice.fr/ mh/rr/2005/rr-05.21-s.schmitz.pdf. [11] Seppo Sippu nd Eljs Soislon-Soininen. Prsing Theory, Vol. II: LR(k) nd LL(k) Prsing, volume 20 of EATCS Monogrphs on Theoreticl Computer Science. Springer, 1990. ISBN 3-540-51732-4. [12] Thoms G. Szymnski nd John H. Willims. Noncnonicl extensions of bottom-up prsing techniques. SIAM Journl on Computing, 5(2):231 250, 1976. ISSN 0097-5397. doi: 10.1137/0205019. [13] Kuo-Chung Ti. Noncnonicl SLR(1) grmmrs. ACM Trnsctions on Progrmming Lnguges nd Systems, 1(2):295 320, 1979. ISSN 0164-0925. doi: 10.1145/357073.357083. [14] Robert E. Trjn. Depth first serch nd liner grph lgorithms. SIAM Journl on Computing, 1(2):146 160, 1972. ISSN 0097-5397. doi: 10.1137/0201010. [15] Msru Tomit. Efficient Prsing for Nturl Lnguge. Kluwer Acdemic Publishers, 1986. ISBN 0-89838-202-5.