PARSING AND TRANSLATION November 2010 prof. Ing. Bořivoj Melichar, DrSc. doc. Ing. Jan Janoušek, Ph.D. Ing. Ladislav Vagner, Ph.D.

Size: px

Start display at page:

Download "PARSING AND TRANSLATION November 2010 prof. Ing. Bořivoj Melichar, DrSc. doc. Ing. Jan Janoušek, Ph.D. Ing. Ladislav Vagner, Ph.D."

Emery Thornton
5 years ago
Views:

1 PARSING AND TRANSLATION November 2010 prof. Ing. Bořivoj Melichar, DrSc. doc. Ing. Jan Janoušek, Ph.D. Ing. Ladislav Vagner, Ph.D.

3 Preface More than 40 years of previous development in the area of compiler construction provides us with the main following results: The basic principles of translation process decomposition are now well understood. Moreover, the decomposition tightly corresponds with the organization of the translator modules. There exist formal methods of language description, as well as formal methods that describe translations. In addition to the exactness, these methods allow translation description without direct relation to their algorithmic imlpementations. There are methods that can be used to create parser or translator. These methods construct the parser or compiler program based on the formal description of the language or translation, respectively. There are several utility programs that allow automated construction of compilers or their parts. The theory of formal languages, grammars and automata has an important role in the development of the compiler construction methods. One of the most important result of this theory is the parsing theory. It provides us with several parsing algorithms. The most important of them are the parsing algorithms for LL and LR grammars. These algorithms are in common use in nowadays compilers. The milestone in the development of compiler construction methods is the concept of syntax directed translation. The basis of this concept is the fact that the parser can take over the entire translation process. This idea leaded to the theory of formal translation, translation grammars, attribute grammars, and translation automata. The most important practical result of all the above mentioned theories are algorithms that allow the construction of an algorithmical implementation of a parsing or translation method based on its non-procedural description. Such an algorithms are the basis for program tools for automated construction of compiler parts. Even that the development in the above mentioned theories is far from being finished, our current knowledge in the area of compiler construction represents large theoretical basis that is not comparable to any other area in the computer software development. The sequels of this evident fact are approaches to employ the compiler construction principles in other areas such as in text editors, syntax-directed editors, information and database systems, pattern matching systems, text formatting systems, image drawing systems and many other program products. Prague, November 2010 Authors

4 Contents 1 Notions used in this textbook 1 2 LR grammars and languages Strong LR grammars Weak LR grammars LR(0) grammars Simple LR(k) grammars LALR grammars LR(k) grammars Properties of LR grammars an languages Formal translation and bottom-up parsing Pushdown translation automata and postfix grammars Formal translation directed by LR parsing Postfix translation grammars with LR(k) input grammars LR(k) translation grammars Translation grammars with LR(k) input grammars Attributed translation directed by LR parser S attributed grammars LR attributed translation grammars Parallel parsing Fundamental parallel algorithms Parallel algorithm performance evaluation Parallel reduction Parallel prefix sum Parallel parentheses matching Parallel finite automaton Parallel LL parsing Parallel parser structure Nondeterministic parallel LL parsing Deterministic parallel LL parsing LLP(1,k) grammars LLP(q,k) grammars Performance analysis An optimal EREW PRAM algorithm LLP grammars and languages i

5 5.4 Parallel LR parsing Sequential LR parsing Ideal parallel LR parsing Deterministic parallel LR parsing Gluing processes Parsing with Reduced Pushdown Store Activity Reductions between Shifts of Two Adjacent Symbols Faster GLR Parsing Some Empirical Results Reconstructing Derivations Bibliography 117 ii

6 List of Figures 2.1 LR automaton goto function as a graph LR automaton from Example LR automaton from Example LR automaton for grammar from Example LR automaton for LALR(1) grammar from Example LR automaton for LR(1) grammar from Example Transitions of translation automaton LR parsing directed translation of string a+a a from Example Formal translation of an input string by LR(1) grammar from Example Translation trees for pair (aabb, xxyyyy) from Examples 3.21 and GOTO graph for grammar from Example Attribute translation tree for string id;k[25], Derivation tree Parallel reduction algorithm Parallel prefix sum algorithm Parallel parentheses matching algorithm Parallel finite automaton algorithm Parsing table for the grammar from Example Parallel LL parser structure Processor network for parallel parsing of string a+a a Possible leaf processes for the grammar from Example Parallel parsing of input string a+a a Possible processes for given lookahead and lookback strings Deterministic parallel LL parsing for input string a+a a Deterministic parallel LL parsing with time optimal gluing Relation between LL, LLP and regular languages Parallel parsing and gluing for input string a+a a Partial parsing and some gluing for input string a] Parsing table of standard LR(1) parser for expression grammar, generated by SLR technique Our pushdown automaton for the expression grammar. The subroutine recognizing G E is enclosed in the dashed box, and the grayed circle is the start state Trace of standard LR parser (left), and our automaton (right) iii

7 6.4 The optimized pushdown automaton. Each edge is labelled by a triple a, x, y, where a is the symbol to be read, x is the change of the pushdown store, and y is the output Trace of optimized pushdown automaton Timing results for some ambiguous grammars Timing results for the expression grammar from Example A sample reduction log and derivations. Squares are reduction nodes and circles are fan-in nodes iv

8 Chapter 1 Notions used in this textbook In this textbook, many terms from the theory of formal translation and attributed grammars, and theory of translation automata will be introduced. We will use terms from logic, set theory, relation theory, graph theory, grammar theory, and automata theory to define the above mentioned terms. These terms wil be recalled in this section. Logic A statement is a sentence, that can be unambiguously decided to be either true or false. To simplify the notion of complex assertions, we use the following operators: logical conjunction (conjunction), logical disjunction (disjunction), logical implication, equivalence. Let P and Q be assertions. Then: P Q is true if both P and Q are true simultaneously, P Q is true if at least one of P and Q is true, P Q (P implies Q) is not true if and only if P is true and Q is false, P Q is true if and only if both P and Q are simultaneously true of both are simultaneously false. Let us present some example statement: (A = B) (A B B A). The statement states that two sets A and B are equal if and only if one set is a subset of the other set and vice versa. We will use the following equivalence for some proofs: let P and Q be arbitrary assertions. Then it holds: (P Q) [(P Q) (Q P)]. In other words, two statements P and Q are equivalent if and only if P implies Q and Q implies P. The symbol denotes universal quantifier, symbol existential quantifier. The notion xp means P holds for all x, notion xp means there exists x such that P holds. 1

9 Sets We will use the term set in the usual intuitive meaning. The notion x X denotes that x is an element of the set X, x X denotes the fact x is not an element of the set X. Theinclusion between two sets will bedenoted by notion X Y which means that every element of the set X is also present in the set Y, i.e. X is a subset of Y. The notion X = {x : P(x)} means that the set X contains elements x that they match the condition P(x). Some examples from the set theory follows: union: A B = {x : x A x B}, intersection: A B = {x : x A x B}, difference: A B = {x : x A x B}, cartesian product: A B = {(x,y) : x A y B}, power set: 2 A = {B : B A}. A set that contains only a finite number of elements will be denoted finite set. Finite sets may be specified by exhaustive enumeration of their elements. The notion X = {a,b,c} means that the set X contains just the three elements a, b, and c (it does not contain any other elements). Empty set will be denoted by the symbol. If for some sets holds: A B =, then the sets A and B are disjoint sets. Relations Abinaryrelationbetweenelemets ofsetaandelementsofsetbiseverysetrwherer A B. The fact (x,y) R is often denoted xry. A relation over set A is every relation R defined R A A. Such a relation can be: reflexive: if x A is xrx, symmetric: if x,y A is xry yrx, antisymmetric: if xry and yrx implies that x = y, transitive: if xry and yrz implies that xrz. The product of a relation R A B and a relation S B C is a relation R S = {(x,y) : z B : (x,z) R (z,y) S}. The k-th power of a relation R over set A is defined for k 0: R 0 = {(x,x) : x A}, R k = R k 1 R pro k > 0. Transitive closure of a relation R over set A is a relation R + = Transitive and reflexive closure of a relation R over set A is a relation R = Mapping Mapping is a special case of a relation. Mapping from set A to set B is any relation F A B such that x A, there exists at most one y B such that xfy. For mappings, we use the notion F(x) = y instead of xfy. The element y denotes the value of F for x. A mapping F from set A to set B is denoted F : A B. If there are some x A for which the value of F(x) is not defined then F is partial mapping. The opposite case is complete mapping which is defined for all x A. Graphs Oriented graph is a pair (V,H), where V is a finite set of nodes and H V V is a set of edges. k=1 R k. k=0 R k. 2

10 Edges are denoted (x,y) H, where x stands from the node where the edge starts and y denotes the node where the edge ends (leads to). A finite sequence of edges (x 0,x 1 ), (x 1,x 2 ),..., (x n 1,x n ) (where no node occurs more than once) is denoted path of length n from node x 0 to node x n. A graph (V,H) is a tree if it contains just one node where no edge leads to (this node is the root node), and for all nodes y different from the root node, there exist a path from the root node to the y. The nodes of a tree where no edge starts are leaves. Formal languages An alphabet is arbitrary finite nonempty set of elements symbols. A string over alphabet is any finite sequence of symbols from the alphabet. The empty sequence is also a string, an empty string, which will be denoted ε. The set of all strings over alphabet T will be denoted T. The set of all nonempty strings over T is T +. It holds that: T = T + {ε}. If strings x and y are from T then z = xy is a concatenation of strings x and y. The length of a string is the number of symbols the string consists of. The length of a string x is denoted x. Formal language L over an alphabet T is arbitrary subset of T. The complement of a language L 1 over alphabet T is language L 2 = T L 1. The product of languages L 1 and L 2 is a language L = L 1.L 2 = {xy : x L 1 y L 2 }. The k-th power of a language L over T is defiend for k 0 as follows: L 0 = {ε}, L k = L k 1.L for k > 0. An iteration of a language L is language L = L n. n=0 A positive iteration of a language L is language L + = L n. Grammars A grammar is a fourtuple G = (N,T,P,S), where N is a finite set of nonterminal symbols (nonterminals for short), T is a finite set of terminal symbols, P (N T).N.(N T) (N T) is a finite set of rules (a rule (α,β) from P is often denoted α β), and S N is starting symbol. A context-free grammar is a grammar where rules are of the form A α, A N, α (N T). A regular grammar is a grammar, where rules are of the form A a, A ab, A,B N, a T. The relation α β (N T) (N T) is a derivation in a grammar G if α = γxδ, β = γωδ, γ,δ,ω (N T), X N, X ω P. The k-th power, transitive closure, and reflexive and transitive closure of the derivation relation will be denoted k, +, and, respectively. A language L generated by a grammar G is a set L(G) = {x : x T S x}. Finite automata (Nondeterministic) finite automaton is a quintuple A = (Q,T,δ,q 0,F), where Q is a finite set of states, T is an input alphabet, δ is a mapping from Q T to 2 Q, q 0 Q is initial state, and F Q is a set of final states. A pair (q,w) Q T is denoted configuration of a finite automaton, (q 0,w) is the initial configuration, and (q,ε) where q F is final (accepting) configuration. A relation (q,aw) (p,w) (Q T ) (Q T ) is a transition of an automaton A, if p δ(q,a). The k-th power, transitive, and transitive and reflexive closure of the relation is denoted k, +,, respectively. A finite automaton A is deterministic if it holds: q Q,a T : δ(q,a) 1. n=1 3

11 A language L accepted by a finite automaton A is a set L(A) = {x : x T δ(q 0,x) (q,ε) q F}. Pushdown automata (Nondeterministic) pushdown automaton is a seven-tuple M = (Q,T,G,δ,q 0,Z 0,F), where Q is a finite set of states, T is an input alphabet, G is a pushdown store alphabet, δ is a mapping from Q (T {ε}) G into a set of finite subsets of Q G, q 0 Q is an initial state, Z 0 G is the initial contents of the pushdown store, and F Q is the set of final (accepting) states. Triplet (q,w,x) Q T G denotes theconfiguration ofapushdownautomaton. Theinitial configuration of a pushdown automaton is a triplet (q 0,w,Z 0 ) for the input word w T. Therelation (q,aw,αβ) (p,w,γβ) (Q T G ) (Q T G ) is a transition of a pushdown automaton M if (p,γ) δ(q,a,α). The k-th power, transitive closure, and transitive and reflexive closure of the relation is denoted k, +,, respectively. A pushdown automaton M is deterministic, if it holds: δ(q,a,γ) 1 for all q Q, a T {ε}, γ G. If δ(q,a,α), δ(q,a,β) and α β then α is not a suffix of β and β is not a suffix of α. If δ(q,a,α), δ(q,ε,β), then α is not a suffix of β and β is not a suffix of α. A language L accepted by a pushdown automaton M is defined in two distinct ways: Accepting by final state: L(M) = {x : δ(q 0,x,Z 0 ) (q,ε,γ) x T γ G q F}, Accepting by empty pushdown store: Lε(M) = {x : (q 0,x,Z 0 ) (q,ε,ε) x T q Q}. 4

12 Chapter 2 LR grammars and languages This chapter introduces a group of parsing algorithms which create parsing tree of the input string from bottom to top. These algorithms are named LR parsers since they read input string from left to right and they produce right parse of the input. The algorithm may use the information of the nearest k symbols of the unread part of the input string. Grammars which allow such a parser to be constructed are named LR(k) grammars. The basic principle of LR parser can be stated as follows: Let G = (N,T,P,S) be an unambiguous context-free grammar and let w = a 1 a 2...a n be an input string from language L(G). Then there exists rightmost derivation S = γ 1 γ 2... γ m = w. Since the mentioned derivation is rightmost one, every sentential form γ i (i = 1,2,...,m 1) is of the form γ i = αaa j a j+1...a n, where A N, α (N T) and string a j a j+1...a n T is a suffix of the input string w. Suppose γ i 1 = αbz and a rule B β be used in a derivation step γ i 1 γ i (that is αbz αβz). The main problem of deterministic bottom-up parsing is to find out the correct string βz in the sentential form γ i. If the string is found, sentential form γ i can be reduced to the sentential form γ i 1. The model of a bottom-up parser is pushdown automaton. Such an automaton is, in general, nondeterministic one, thus cannot be directly used as a parser. Let consider how a deterministic pushdown automaton can be constructed for a given grammar and what circumstances are to be satisfied. Given a context-free grammar, a pushdown automaton can be constructed. The transition mapping δ is defined as follows (remember, the top of the pushdown store is on the right-hand side): 1. δ(q,a,ε) = {(q,a)} a T, 2. δ(q,ε,α) = {(q,a) : A α P}, 3. δ(q,ε,#s) = {(r,ε)}. The operations are denoted shift (1), reduce (2), and accept (3). The above shown construction leads to a pushdown automaton that is nondeterministic in all cases. The reason resides in the fact that the shift is defined by a transition δ(q,a,ε) = {(q,a)} and reduce by a transition δ(q,ε,α) = {(q,a)}. For both transitions, the string ε is a prefix (an also a suffix) of the string α. To obtain a really deterministic pushdown automaton, the construction is to be changed. The main problem of the construction is the fact that shift operations are done regardless the contents of the pushdown store. Therefore, we will attempt to modify the automaton in such a way that it can decide which operation (shift or reduce) to perform based on the symbol that is on the top of the pushdown store. We will demonstrate the technique in the following example. 5

13 Example 2.1: Let a context-free grammar be G = ({S,A,B},{a,b,c,d},P,S), where P contains rules: (1) S Aa (2) A bb (3) A Ac (4) B d A pushdown automaton for that grammar can be constructed as follows: R = ({q,r},{a,b,c,d},{s,a,b,a,b,c,d,#},δ,q,#,{r}), where transition δ is defined: 1. δ(q,a,ε) = {(q,a)} δ(q,b,ε)) = {(q,b)} δ(q,c,ε)) = {(q,c)} δ(q,d,ε)) = {(q,d)} 2. δ(q,ε,aa) = {(q,s)} δ(q,ε,bb) = {(q,a)} δ(q,ε,ac) = {(q,a)} δ(q,ε,d) = {(q,b)} 3. δ(q,ε,#s) = {(r,ε))} This pushdown automaton is nondeterministic because it performs shifts according to its definitions. Based on the contents of the pushdown store, the shifts may be performed as follows: δ(q,a,a) = {(q,aa)} - symbols a and c appear in the sentential form after δ(q,c,a) = {(q,ac)} the symbol A, δ(q,b,#) = {(q,#b)} - symbol b can appear at the beginning of the sentential form only, δ(q,d,b) = {(q,bd)} - symbol d can appear just after symbol b only. This modification leads to a deterministic pushdown automaton for the given grammar. However, this technique is not universal, it can be used for a limited class of grammars (for strong LR(0) grammars) only. For other grammars, the modification will not work and the resulting pushdown automaton will not be deterministic. The bottom-up parser is similar to top-down parser both the parsers can use the following additional information to choose next operation while parsing: 1. the information about the not-yet read part of the input string, 2. the information about parsing in the past. There are grammars that can be deterministically parsed by the bottom-up parser with the additional information about up to k closest symbols in the unread part of the input string. These grammars are strong LR(k) grammars. In the next sections we will study two classes of LR grammars. First, we will introduce strong LR(k) grammars. Then, weak LR(k) grammars will be studied. Deterministic parsing of weak LR(k) grammars must use information about the parsing history. Both classes of LR grammars use the same (except for slight modifications) parsing algorithm which is based on the pushdown automaton. For both classes of grammars, a parsing table is used to decide whether a reduction is to be performed or not and which reduction is to be choosen. The parsing table contains all necessary information. The table is constructed based on the grammar. The construction algorithm is different for both classes of LR grammars. We will describe the algorithms for individual cases in the following sections. 6

14 2.1 Strong LR grammars Strong LR grammars are context-free grammars, for which exist a deterministic bottom-up parser that: 1. uses the information about up to k closest symbols in the not-yet read part of the input string, 2. does not use the information about the parsing history. Prior to defining strong LR(k) grammars, we will introduce functions BEFORE and EFF k. Definition 2.2: Let G = (N,T,P,S) be a context-free grammar, X N, and α (N T). Functions BEFORE(X) and EFF k (α) are defined as follows: BEFORE(X) = {Y : S αyxβ,y (N T)} {# : S Xβ}, EFF k (α) = {w : w FIRST k (α),and there exists rightmost derivation α β wx such, that for no β holds that β = Awx }. The set EFF k (α) contains all strings from the set FIRST k (α) that were not derived by a derivation α β wx such that the first nonterminal in β was substitued by an empty string. The name EFF stands for ε-free first. Now, we can define strong LR(k) grammar. Definition 2.3: Acontex-freegrammarG = (N,T,P,S)isastrongLR(k)grammmar, iftheaugmentedgrammar G = (N {S },T,P {S S},S ) mets the following criteria: 1. If P contains a pair of rules of the form: (a) A αx,b βx, (b) A αx,b ε and X BEFORE(B), or (c) A ε,b ε and X BEFORE(B), X BEFORE(A) then FOLLOW k (A) FOLLOW k (B) =. 2. If P contains a pair of rules of the form: (a) A αx, B αxγ, (b) A ε,b αxγ and X BEFORE(A), or (c) A ε,b γ and X BEFORE(A), X BEFORE(B) then FOLLOW k (A) EFF k (γfollow k (B)) =. The first condition ensures that in the case of a reduction, it is possible to choose the correct rule for the reduction based on up to k lookahead symbols. The second condition guarantees that it is possible to decide whether reduction or shift operation is to be performed. Similarly to the top-down parser, the bottom-up parser uses parsing table when choosing the next operation that is to be performed. The table entries contain the appropriate operation, which is based on the topmost pushdown store symbol and on the lookahead string. The parsing table for a strong LR(k) grammar can be constructed using the following algorithm. Algorithm 2.4: Construction of a parsing table for a strong LR(k) grammar. Input: Strong LR(k) grammar G = (N,T,P,S). Output: Parsing table p for G. Method: Parsing table p is defined over (N T {#}) T k. 7

15 1. The input grammar G is augmented: 2. Parsing table p is constructed: G = (N {S },T,P {S S},S ). (a) p(x,u) = reduce(i), if A αx is i-th rule in P and u FOLLOW k (A). (b) p(x,u) = reduce(i), if A ε is i-th rule in P, X BEFORE(A), u FOLLOW k (A). (c) p(s, ε) = accept. (d) p(x,u) = shift, if B βxγ P and u EFF k (γfollow k (B)). (e) p(x,u) = error in all other cases. Example 2.5: Given grammar G = ({E,E,T,T,F},{a,+,,(,)},P,E), where P contains rules below. Evaluate parsing table for G. (1) E E T (5) T T (2) E E+ (6) T ε (3) E ε (7) F (E) (4) T T F (8) F a We augment the grammar by the rule (0) S E. Grammar G is a strong LR(1) grammar, thus parsing table may be constructed. The table is shown below. The operations in the table are denoted as follows: Sh...shift, R(i)...reduce(i), A...accept, error entries are left blank. When constructing the table, we used that BEFORE(E ) = {#,(} and BEFORE(T ) = {E }. p a + ( ) ε E Sh Sh A E R(6) R(6) T R(1) Sh R(1) R(1) T Sh Sh F R(4) R(4) R(4) R(4) a R(8) R(8) R(8) R(8) + R(2) R(2) R(5) R(5) ( R(3) R(3) ) R(7) R(7) R(7) R(7) # R(3) R(3) Strong LR(k) parsing can be done using the following algorithm. Algorithm 2.6: Strong LR(k) parsing algorithm. Input: Parsing table p for grammar G = (N,T,P,S), grammar G rules, and input string w T. Output: Right parse in case string w L(G), error signaling otherwise. Method: The algorithm reads symbols from the input string w, makes use of the pushdown store and creates a string of numbers of rules which were used during the reductions. The initial pushdown store contents is#. The algorithm repeats steps(1a) and(1b) until the input string is either accepted or rejected (error signaling). In the description below, let the symbol X denote the symbol on the top of the pushdown store. 1. Evaluate the lookahead string (of length k), let it be u. (a) If p(x,u) = shift, one symbol is read from the input string and is stored in the pushdown store. 8

16 (b) If p(x,u) = reduce(i), the algorithm finds rule i, let it be A α. Then remove (pop) string α from the pushdown store, push the symbol A on the pushdown store and append rule number i to the right parse. If the contents of the pushdown store was not equal to α (and thus it was not possible to remove it), an error is detected and parsing ends with an error signaling. (c) If p(x,ε) = accept and the contents of the pushdown store is #X, the parsing was successful and the output string is correct right parse of the input string. If the contents of the pushdown store is different, an error signaling takes place instead. (d) If p(x,u) = error, the parsing ends with an error signaling. A configuration of the parsing algrithm is triplet (α,x,π), where α is the contents of the pushdown store (topmost symbol is on the right-hand side), x is the not-yet read part of the input string, π is the so far created part of the output, (#, w, ε) is the initial configuration, (#S, ε, π) is the final configuration. Example 2.7: Let us demonstrate the parsing of the input string a+a a. We will use parsing table evaluated in Example 2.5. (#,a+a a,ε) (#E, a+a a, 3) (#E T, a+a a, 36) (#E T a, +a a, 36) (#E T F, +a a, 368) (#E T, +a a, 3684) (#E, +a a, 36841) (#E+, a a, 36841) (#E, a a, ) (#E T, a a, ) (#E T a, a, ) (#E T F, a, ) (#E T, a, ) (#E T, a, ) (#E T, a, ) (#E T a, ε, ) (#E T F, ε, ) (#E T, ε, ) (#E, ε, ) A special case of strong LR(k) grammars are strong LR(0) grammars. For these grammars, the information about the topmost symbol on the pushdown store is sufficient when choosing the next operation during parsing. Any strong LR(0) grammar has these properties: 1. the right-hand sides of all rules in the augmented grammar G end with mutualy different symbols, 2. a symbol occuring at the end of a right-hand side of any rule does not appear in any other rule on the right-hand side. 9

17 The above properties imply that grammar G does not contain any ε-rules and that starting symbol S does not appear on right-hand side of any rule in G. Moreover, the symbols occuring at the end of right-hand sides of rules positively identify when a reduction is to be performed and what rule to use for the reduction. If a symbol X, which occurs at the end of rule A αx, appears on the top of the pushdown store, then reduction by rule A αx is to be performed. Example 2.8: Let grammar G = ({S,A,B},{a,b,c,d},P,S) have P containing rules: (1) S Aa (3) A Ac (2) A bb (4) B d Grammar G is a strong LR(0) grammar. We augment the grammar with rule (0) S S and construct parsing table. p ε S A A Sh B R(2) a R(1) b Sh c R(3) d R(4) # Sh The parsing of the input string bdca is depicted below. (#,bdca,ε) (#b, dca, ε) (#bd, ca, ε) (#bb, ca, 4) (#A, ca, 42) (#Ac, a, 42) (#A, a, 423) (#Aa, ε, 423) (#S, ε, 4231) 2.2 Weak LR grammars ThestrongLRparsingwasusinginformationaboutk symbolsfromthenot-yetreadpartoftheinput string and one symbol on the top of the pushdown store only. In the case of weak LR grammars, such an information is not enough. When choosing the next rule in the parsing, weak LR parser uses the information about parsing history in addition to the strong LR parser. Obviously, weak LR grammars include strong LR grammars as a proper subset. For that reason, we will omit the weak adjective in the next text. During the parsing of an input string x generated by a grammar G, the bottom-up parser uses the pushdown store to keep a string that corresponds to a prefix of some rightmost sentential form that occures in the rightmost derivation of x in G. If a grammar G = (N,T,P,S) allows a derivation S αaw αβw xw, then rightmost sentential form w may be reduced using rule A β to rightmost sentential form αaw. The substring β is a handle of sentential form αβw,α,β (N T),w T. Definition 2.9: Let G = (N,T,P,S) be a context-free grammar and let G = (N {S },T,P redukční část 10

18 {S S},S ) be augmented grammar for G. We state that G is LR(k) grammar k 0, if (all derivations are rightmost): 1. S rm αaw αβw, 2. S rm γbx αβy, 3. FIRST k (w) = FIRST k (y) implies that αay = γbx, (i.e. α = γ, A = B, and x = y). In other words, it is possible to positively decide that a reduction by a rule A β is to be performed based on string α and the lookahead string of length up to k symbols. Definition 2.10: Assume that S αaw αβw is a rightmost derivation in a context-free grammar G = (N,T,P,S). A string γ is a viable prefix in G, if it is a prefix of αβ. It means that the string γ is a string which is a prefix of some rightmost sentential form and it does not include complete handle in that sentential form. If γ = αβ, then γ is a complete viable prefix. In a bottom-up parsing, a viable prefix appears in the pushdown store during the parsing. If the pushdown store contains a complete viable prefix, a reduction can be performed. In addition to the LR(k) grammars, we will study subclasses of LR(k) grammars: LR(0) grammars, simple LR(k) grammars, (SLR(k) grammars), LALR(k) grammars. The reasons for defining such a subclasses of LR(k) grammars are listed below: the construction of a parser for the subclasses is simplier than the construction for general LR(k) parser, the tables for the parser are smaller LR(0) grammars LR(0) grammars are grammars that can be parsed by a deterministic bottom-up parser which uses only parsing history to decide the next parsing operation. It means that the parser does not use the lookahead information. Example 2.11: Given grammar G = ({S,A,B},{a,b},P,S), where P contains rules: (1) S aab (4) A b (2) S aabba (5) B ε (3) A aa We demonstrate that deterministic parsing of strings generated by this grammar can be done using information about parsing history only. This is not obvious, because symbol b, for instance, appears in G in three different places: it is the right-hand side of the rule A b, and it is a part of right hand sides of rules S aab and S aabba. The parsing of strings abab and aaba is demonstrated in the table below. These two examples show that reductions are choosen based on a certain contents of the pushdown store. The next table perspektivní předpona úplná perspektivní předpona 11

19 depicts the strings that will be contained in the pushdown store when a reduction is to be performed using a certain rule. Such strings are always complete viable prefixes. We can see that the example is simple, because the contents of the pushdown store is just one string (one viable prefix) for each rule. Input Pushdown store contents Operation abab ε bab a ab ab ab aa b aaa b aa ε aab ε S aaba ε aba a ba aa ba aab a aabb ε aabba ε S shift a shift b reduction A b shift a reduction A Aa shift b reduction S aab accept shift a shift a reduction B ε shift b shift a reduction S aabba accept There may exist several complete viable prefixes for every rule. Even more, it is possible that there might exist infinite number of complete viable prefixes for a rule. However, it is proven that the sets of complete viable prefixes are regular. It means that it is possible to construct a finite automaton to analyze the viable prefixes. This automaton is called characteristic automaton for LR grammar, LR automaton for short. The usage of a LR automaton for analysis of viable prefixes makes LR parsing simplier. When using LR automaton, it is not needed to traverse the pushdown store to decide which operation to use. Instead, it is sufficient to look at state of the LR automaton. The LR automaton for G augmented by the rule S S is depicted in Figure 2.1. Rule S aab S aabba A Aa A b B ε Pushdown store contents aab aabba aaa ab aa The automaton in Figure 2.1 was constructed such that for every complete viable prefix, there exists a sequence of transitions from the starting state to a final state. The final state corresponds to a certain reduction. Therefore the final states are different and every final state is labeled by a rule that will be used for the reduction. We outline the LR(0) parsing algorithm (the exact algorithm will be stated later): Read input string and traverse the LR automaton accordingly. Store the reached states in the pushdown store. When a final state is reached, a reduction is performed. It means that the states that correspond to the right-hand side of the reduction rule are replaced by the nonterminal standing on the left-hand side of the reduction rule. In the terms of the LR automaton, this can be thought as a return to a state that corresponds to the situation before treating the first symbol on the right-hand side of the reduction rule. In 12

20 START # S S 1 S' S a b 1 b a 1 a B a 2 B B 1 b b 3 A b a 4 A a a A 1 b b 2 a3 A Aa S aab S aabba Figure 2.1: LR automaton that symbol, an edge labeled by the nonterminal standing on the left-hand side of the reduction rule must exist. Using such an edge, new automaton state is reached. To simply evaluate the end of the parsing, we augment the grammar with a new starting symbol S and a new rule S S. The complete viable prefix for that rule is always S. The reduction by this rule can be therefore considered the end of parsing and accepting of the input string. The LR automaton states can be mnemonically labeled as follows: Edges labeled by symbol X lead to a state X i, the subscript i is choosen so the label X i is unique. The starting state will be labeled #. There are three methods to construct the LR automaton: 1. By constructing the collection of sets of LR items. The function GOT O is the transition function of the LR automaton. 2. Construction of LR automaton directly from the grammar. 3. By establishing a system of regular equations. The solution of the equation system are regular expressions describing the sets of complete viable prefixes. The LR automaton can be obtained by constructing finite automaton from these regular equations. The first method will be studied in detail. The second and the third ones are described in [2,8]. Definition 2.12: A LR(0) item is a rule from grammar with a position mark on the right-hand side. We will use the symbol. (dot) to mark the position. For instance A α.β. A set of LR(0) items contains LR(0) items which have identical symbol before the dot mark. A set of LR(0) items describes parsing state at the moment when a certain symbol was pushed onto the pushdown store. The symbols after the dot mark are the symbols which might be pushed onto the pushdown store when changing to a new state. There are two important kinds of LR(0) items in a set of LR(0) items: 1. LR(0) items where the dot mark is followed by a terminal symbol. These items represent situation where shift will be performed. 2. LR(0) items where the dot mark is placed on the end of the right-hand side. These items represent reduction states. For every set M of LR(0) items, there exists a successor set of LR(0) items for every symbol located after the dot mark. We start from the initial set of LR(0) items when constructing the 13

21 collection of sets of LR(0) items. Afterwards, all successors of the initial set are constructed. This operation is performed repeatedly for every set of LR(0) items. A set of LR(0) items is constructed from the kernel and its closure. When constructing the closure, new LR(0) items are added to the set. These items have dot mark before the first symbol of the right hand side and the left hand side nonterminal appears just before the dot mark in some other LR(0) item belonging to the set. The construction of the set of LR(0) items is described in the Algorithm Algorithm 2.13: Construction of the collection of sets of LR(0) items. Input: A context-free grammar G = (N,T,P,S). Output: A collection C of sets of LR(0) items for grammar G. Method: 1. Prepare augmented grammar G : G = (N {S },T,P {S S},S ), where S N. 2. The initial set of LR(0) items # is constructed as follows: (a) # := {S.S}. (b) If A.Bα #,B N and B β P, then # := # {B.β} (c) Repeat step b) until no new item can be added to the set #. (d) C := {#}, # is the initial set. 3. Having constructed a set of LR(0) items M i, a new set of LR(0) items X j will be constructed for every symbol X (N T) which appears just after the dot mark in some LR(0) item in M i. The index j will be choosen so it is higher than the so-far highest index X n. (a) X j := {A αx.β : A α.xβ M i }. (b) If A α.bβ X j,b N, B γ P, then X j := X j {B.γ}. (c) Repeat step b) until no new item can be added to X j. (d) C := C {X j },goto(m i,x) = X j. 4. Repeat step 3. until no new set of LR(0) items can be added to the collection C. Note: The steps 2a) and 3a) create the kernel of a set of LR(0) items. The closure is evaluated by repeating steps 2b) and 3b). Example 2.14: Given grammar G = ({S,A,B},{a,b,c},P,S), where P contains rules below. Evaluate the set of LR(0) items. (1) S B (4) A ba (2) B abb (5) A c (3) B A The set of LR(0) items is: 14

22 G. # = {S.S B 1 = {S B.} S.B a = {B a.bb B.aBb B.aBb B.A B.A A.bA A.bA A.c} A.c} S = {S S.} c = {A c.} A 1 = {B A.} B 2 = {B ab.b} b 1 = {A b.a A 2 = {A ba.} A.bA b 2 = {B abb.} A.c} Now, wearereadyto introducethegoto functiondefinedover theset oflr(0) items forgrammar Definition 2.15: Function goto(m i,x) = X j, if items of the form A α.xβ are in a set M i and the kernel of X j was formed by items of the form A αx.β. The goto function can be represented as an oriented, node and edge labeled graph. The transition goto(m i,x) = X j corresponds to the graph depicted in Figure 2.2. The GOTO function is the transition mapping of the LR automaton. M i x X j Figure 2.2: goto function as a graph Example 2.16: Let us construct goto function for the set of LR(0) items from the Example The function is depicted in Figure 2.3. START S S A 1 A # B B 1 A a c b a a c c c b 2 b B b A b B 2 b 1 A 2 Figure 2.3: LR automaton from Example

23 Definition 2.17: A context free grammar G = (N,T,P,S) is an LR(0) grammar, if it holds: If a set M from the collection of sets of LR(0) items for the grammar G contains an item of the form A α., then the set M does not contain any other item of the form B β. or B β.γ, where γ starts by a terminal. Note: The above definition is equivalent to the definition 2.9 for the case of k = 0. We will now present how to construct a parsing table based on a collection of sets of LR(0) items. Algorithm 2.18: Construction of a parsing table for LR(0) grammars. Input: A collection C of sets of LR(0) items for augmented grammar G = (N,T,P,S ). Output: Parsing table p for grammar G. Method: The parsing table p will have rows labeled by symbols that correspond to sets from C. For all M i C do: 1. p(m i ) = accept, if S S. M i, 2. p(m i ) = reduce(j), if A β. M i and A β is the j-th rule from P, and A S, β S, 3. p(m i ) = shift in all other cases. Example 2.19: Given grammar G = ({S,S,A,B},{a,b,0,1},P,S ), where P contains rules: (0) S S (4) A 0 (1) S A (5) B abbb (2) S B (6) B 1 (3) A aab This grammar is not LL(k) grammar for any k. We will demonstrate that it is LR(0) grammar. First, we will construct the collection of sets of LR(0) items for G using Algorithm # = {S.A A 1 = {S A.} S.B B 1 = {S B.} A.aAb 0 = {A 0.} A.0 1 = {B 1.} B.aBbb A 2 = {A aa.b} B.1 B 2 = {B ab.bb} S.S} b 1 = {A aab.} a = {A a.ab b 2 = {B abb.b} B a.bbb b 3 = {B abbb.} A.aAb S = {S S.} A.0 B.aBbb B.1} From the structure of the sets of LR(0) items, it is clear that the grammar is LR(0). We will construct the parsing table and LR automaton. The automaton is depicted in Figure

24 p action # shift A 1 reduce(1) B 1 reduce(2) a shift 0 reduce(4) 1 reduce(6) A 2 shift B 2 shift b 1 reduce(3) b 2 shift b 3 reduce(5) S accept START S' S S S # A S A A 1 1 a 0 B 1 1 a 0 0 B 1 B 1 A 0 a B A B 2 A 2 b b B abbb b 3 b b 2 b 1 A aab Figure 2.4: LR automaton from Example 2.19 The parsing algorithm for LR(0) grammars is similar to the parsing algorithm for strong LR(0) grammars. The difference is that the states of the LR automaton will be stored in the pushdown store instead of the grammar symbols. Algorithm 2.20: Parsing algorithm for LR(0) grammars. Input: Parsing table p and LR automaton for grammar G, input string w T, and the initial symbolstored inthepushdownstore(thelabel of theinitial set oflr(0) items, #is theconventional name used in this textbook). Output: Right parse of w in case w L(G), an error signaling otherwise. Method: The algorithm reads symbols from the input string w, uses the pushdown store and creates a sequence of numbers of rules used for reductions. The initial pushdown store contents is #. The algorithm repeats steps 1., 2.,..., 6. until it either accepts the string or detects an error. In the steps below, let the X denote the topmost symbol in the pushdown store. 17

25 1. If p(x) = shift, read one symbol and continue with step (5). 2. If p(x) = reduce(i), find i-th rule from P, let it be A α. Pop α symbols from the pushdown store and append rule number (i) to the output. Continue with step (5). 3. If p(x) = accept and entire input string was read, the parsing terminates, input string w is accepted and the output string is the right parse of w. If the entire input string was not read, the parsing ends with an error signaling. 4. If p(x) = error, the parsing ends with an error signaling. 5. Let Y be the symbol that is to be pushed into the pushdown store (Y is either the input symbol read in step (1) or left-hand side of the rule used in step (2)) and let X be the symbol on the top of the pushdown store (note that step (2) could remove certain number of symbols from the pushdown store). 6. If goto(x,y) = Z, then store Z onto the pushdown store and continue with step (1). 7. If goto(x,y) is not defined, the parsing ends with an error signaling. Example 2.21: We will demonstrate LR(0) parsing for input string aa0bb using parsing table and LR automaton from Example (#, aa0bb, ε) (#aa, a0bb, ε) (#aa, 0bb, ε) (#aa0, bb, ε) (#aaa 2, bb, 4) (#aaa 2 b, b, 4) (#aa 2, b, 43) (#aa 2 b, ε, 43) (#A 2, ε, 433) (#S, ε, 4331) Simple LR(k) grammars In the previous section, the construction of LR(0) parser was studied. The condition for LR(0) grammar is fulfilled only by a small subset of grammars. Very often, there is a set of LR(0) items that contain an item of the form A α., which represents reduction, as well as other item of the form B β. representing some other reduction or an item of the form C γ.aδ representing a shift. Such a grammar is not an LR(0) one. We will demonstrate it in the next Example. Example 2.22: Given grammar G = ({E,E,T,F},{+,,(,),a},P,E ), where the set of rules P contains rules: (0) E E (4) T F (1) E E + T (5) F (E) (2) E T (6) F a (3) T T F The collection of sets of LR(0) items for G contains the following sets: 18

26 # = {E.E F 1 = {T F.} E.E +T a = {F a.} E.T + = {E E +.T T.T F T.T F T.F T.F F.(E) F.(E) F.a} F.a} E 1 = {E E. = {T T.F E E.+T} F.(E) T 1 = {E T. F.a} T T. F} E 2 = {F (E.) (= {F (.E) E E.+T} E.E +T T 2 = {E E +T. E.T T T. F} T.T F F 2 = {T T F.} T.F ) = {F (E).} F.(E) F.a} The grammar G is not an LR(0) one. For instance, in the set E 1, there are two items E E. and E E.+T. There does not exist a way how to distinguish whether to accept the input string (reduce by E E) or shift symbol +. This situation is named shift reduce conflict. If there are two different items of the form A α. and B β. in one set, we name it reduce reduce conflict. The conflicts in sets of LR(0) items can be sometimes removed. The decision about the operation to be performed in the conflicting situation can be done based on the symbols that appear at the beginning of the not-yet read part of the input string. If this idea leads to removing of the conflicts, we stand that the grammar is simple LR(k) grammar (also SLR(k) grammar). The k 1 denotes the length of the prefix of the not-yet read part of the input string that needs to be scaned to remove conflicts in the sets of LR(0) items. Definition 2.23: A context-free grammar G = (N,T,P,S) is a simple LR(k) (SLR(k)) grammar if it holds: Let C be a collection of sets of LR(0) items for G and let A α.β and B γ.δ be two different LR(0) items in arbitrary set of LR(0) items in C. Then any pair of items must match at least one of the following: 1. either β N(N T) or δ N(N T), 2. neither β nor δ are both empty strings, 3. β ε,δ = ε and FOLLOW k (B) FIRST k (βfollow k (A)) =, 4. β = ε,δ ε and FOLLOW k (A) FIRST k (δfollow k (B)) =, 5. β = δ = ε and FOLLOW k (A) FOLLOW k (B) =. The notion β N(N T) means that the string β starts with a nonterminal symbol. The parsing table p for SLR(k) grammars can be constructed using the following algorithm: Algorithm 2.24: The construction of parsing table p for SLR(k) grammar. Input: A SLR(k) grammar G = (N,T,P,S) and the collection of sets of LR(0) items from, grammmar G. Output: The parsing table p for grammar G. Method: The parsing table p will have rows labeled by names that correspond to the names of sets from C. The columns will be labeled by strings from T k. 19

27 1. p(m i,u) = shift, if A β 1.β 2 M i, β 2 T(N T), and u FIRST k (β 2 FOLLOW k (A)). 2. p(m i,u) = reduce(j), if j 1, A β. M i, A β is j-th rule in P, and u FOLLOW k (A). 3. p(m i,ε) = accept, if S S. M i. 4. p(m i,u) = error in all other cases. We will now show how to use the above algorithm to construct parsing table for grammar G from Example Example 2.25: First, we will construct the goto function. Function goto(a, X) is defined for X N T. For the collection of sets of LR(0) items from Example 2.22, the goto function is depicted as an LR automaton in Figure 2.5. START # E F T a E 1 F 1 ( + F F T 1 a a + ( T + E E 2 * a T 2 ) * ) ( T a * F F 2 ( ( Figure 2.5: LR automaton for grammar from Example 2.22 Using Algorithm 2.24, we will construct parsing table F for grammar from Example The table contains the following abbreviations: Sh shift, R i reduce(i), and A accept. Error entries are left blank. 20

28 p a + ( ) ε # Sh Sh E 1 Sh A T 1 R 2 Sh R 2 R 2 F 1 R 4 R 4 R 4 R 4 a R 6 R 6 R 6 R 6 ( Sh Sh + Sh Sh Sh Sh E 2 Sh Sh T 2 R 1 Sh R 1 R 1 F 2 R 3 R 3 R 3 R 3 ) R 5 R 5 R 5 R 5 Now, we have to present the modified parsing algorithm since the parsing algorithm for LR(0) grammars cannot be used directly. The reason is the fact that parsing table F is two dimensional for the case of SLR(k) grammars. Algorithm 2.26: Parsing algorithm for SLR(k) grammars. The parsing algorithm is suitable for LALR(k) and LR(k) grammars as well. (The latter two classes of grammars will be presented in the next sections.) Input: Parsing table p and LR automaton for grammar G = (N,T,P,S), input string w T, and the initial symbol stored in the pushdown store (the label of the initial set of LR(0) items, # is the conventional name in this textbook). Output: Right parse of w in case w L(G), an error signaling otherwise. Method: The algorithm reads symbols from the input string w, uses the pushdown store and creates a sequence of numbers of rules used for reductions. The initial pushdown store contents is #. The algorithm repeats steps 1., 2.,..., 7. until it either accepts the string or detects an error. In the steps below, let the X denote the topmost symbol in the pushdown store. 1. Evaluate the first k unread symbols from the not-yet read part of the input string. Let it be string u. 2. If p(x,u) = shift, read one input symbol and proceed with step (6). 3. If p(x,u) = reduce(i), find i-th rule from P, let it be A α. Pop α symbols from the pushdown store and append rule number (i) to the output. Continue with step (6). 4. If p(x,u) = accept (i.e. u = ε and therefore entire input string was read), the parsing terminates, input string w is accepted and the output string is the right parse of w. 5. If p(x,u) = error, the parsing ends with an error signaling. 6. Let Y be symbol that is to be pushed into the pushdown store (Y is either the input symbol read in step (2) or left hand side of the rule used in step (3)) and let X be the symbol on the top of the pushdown store (note that step (3) could remove certain number of symbols from the pushdown store). 7. If goto(x,y) = Z, then store Z onto the pushdown store and continue with step (1). 8. If goto(x,y) is not defined, the parsing ends with an error signaling. Note: When reducing in step (3), the contents of the pushdown store can be simply removed without further examining. This is different from the case of strong LR parsers (see Algorithm 2.6). Example 2.27: We will demonstrate parsing for grammar G from Example The the input string will be a+a a. We will use parsing table and LR automaton from Example

Parsing -3. A View During TD Parsing

Parsing -3. A View During TD Parsing Parsing -3 Deterministic table-driven parsing techniques Pictorial view of TD and BU parsing BU (shift-reduce) Parsing Handle, viable prefix, items, closures, goto s LR(k): SLR(1), LR(1) Problems with