Syntax Analysis (Part 2) Martin Sulzmann Martin Sulzmann Syntax Analysis (Part 2) 1 / 42
Bottom-Up Parsing Idea Build right-most derivation. Scan input and seek for matching right hand sides. Terminology If S β then β is a sentential form of G. Right sentential form = right most reduction of non-terminal symbols. Handle = Right hand side α of a production A α where α is part of a right sentential form S β. Martin Sulzmann Syntax Analysis (Part 2) 2 / 42
Example Consider S aabe (1) A Abc (2) b (3) B d (4) The sentence abbcde can be reduced to S: Rule Sentential form 3 abbcde 2 aabcde 4 aade 1 aabe S Marked forms are the handles. S aabe aade aabcde abbcde right-most derivation Martin Sulzmann Syntax Analysis (Part 2) 3 / 42
Another Example Consider E E +E E E (E) id Right-most derivation: E E +E E +E E E +E id 3 E +id 2 id 3 id 1 +id 2 id 3 Consider how the input string is reduced : right-sentential form Handle reducing production id 1 +id 2 id 3 id 1 E id E +id 2 id 3 id 2 E id E +E id 3 id 3 E id E +E E E E E E E E +E E +E E E +E E Martin Sulzmann Syntax Analysis (Part 2) 4 / 42
Handle Pruning (=Shift/Reduce Parsing) Goal Scan input, recognize handles and replace (if match found) by left-hand side until we reach the start symbol. Shift-Reduce Approach Make use of stack to keep track of partial handles. Shift: Seek for handle. Shift input symbols on some stack. Reduce: Match for handle detected. Replace right hand by left hand side. For A β: pop β symbols pus A onto stack Martin Sulzmann Syntax Analysis (Part 2) 5 / 42
Example Consider E E +E E E (E) id Stack Input Action $ (id id)$ shift $( id id)$ shift $(id id)$ reduce E id $(E id)$ shift $(E id)$ shift $(E id )$ reduce E id $(E E )$ reduce E E E $(E )$ shift $(E) $ reduce E (E) $E $ accept E (E) (E E) (E id) (id id) Martin Sulzmann Syntax Analysis (Part 2) 6 / 42
LR Parsing Approach Instance of shift-reduce parser where we employ FSA to recognize handles. Method Build a right-most derivation in reverse w S. Maintain stack of viable prefixes, i.e. forms α such that S αw for some w V t. Shift action: Move token from input w to stack. Reduce action: Apply grammar rule (in reverse) to top section of stack. Martin Sulzmann Syntax Analysis (Part 2) 7 / 42
Skeleton LR Parser Algorithm push s0 token:= next_token() repeat s:= top of stack if action[s,token] = shift si then push si token:= next_token() else if action[s,token] = reduce A->beta then pop beta states s := top of stack push goto[s,a] else if action[s,token] = accept then done else error Martin Sulzmann Syntax Analysis (Part 2) 8 / 42
Example S E E T + E T T F T F F id state action goto id + $ E T F 0 s4 1 2 3 1 acc 2 s5 r3 3 r5 s6 r5 4 r6 r6 r6 5 s4 7 2 3 6 s4 8 3 7 r2 8 r4 r4 Stack Input Action $ 0 id id + id$ s4 $ 0 4 id + id$ r6 $ 0 3 id + id$ s6 $ 0 3 6 id + id$ s4 $ 0 3 6 4 +id$ r6 $ 0 3 6 3 +id$ r5 $ 0 3 6 8 +id$ r4 $ 0 2 +id$ s5 $ 0 2 5 id$ s4 $ 0 2 5 4 $ r6 $ 0 2 5 3 $ r5 $ 0 2 5 2 $ r3 $ 0 2 5 7 $ r2 $ 0 1 $ acc Martin Sulzmann Syntax Analysis (Part 2) 9 / 42
LR(k) Grammars Idea A grammar G is LR(k) if given a right-most derivation S = γ 0 R γ 1 R... R γ n = w we can, for each right-sentential form in the derivation: 1 isolate the handle of each right-sentential form, and 2 determine the production by which to reduce by scanning γ i from left to right, going at most k symbols beyond the right end of the handle of γ i. Martin Sulzmann Syntax Analysis (Part 2) 10 / 42
LR(k) Items LR(k) Items The table construction algorithm uses sets of LR(k) items or configurations to represent the possible states in a parse. A LR(k) item is a pair [α,β] where α is a production from G with a at some position in the rhs, marking how much of the rhs of a production has already been seen β is a lookahead string containing k symbols (terminals or $) As we will see, the two cases of interest are k=0 and k=1. Martin Sulzmann Syntax Analysis (Part 2) 11 / 42
Example The indicates how much of an item we have seen at a given state in the parse: [A XYZ] indicates that the parser is looking for a string that can be derived from XYZ. [A XY Z] indicates that the parser has seen a string derived from XY and is looking for one derivable from Z. LR(0) items (no lookahead): A XYZ generates 4 LR(0) items: 1. [A XYZ] 2. [A X YZ] 3. [A XY Z] 4. [A XYZ ] Martin Sulzmann Syntax Analysis (Part 2) 12 / 42
The CFSM The Characteristic Finite State Machine (CFSM) for a grammar is a DFA which recognizes viable prefixes of right-sentential forms (Recall: A viable prefix is any prefix that does not extend beyond the handle). It accepts when a handle has been discovered and needs to be reduced. To construct the CFSM we need two functions: closure0(i) to build its states goto0(i,x) to determine its transitions Martin Sulzmann Syntax Analysis (Part 2) 13 / 42
closure0 Given an item [A α Bβ], its closure contains the item itself and any other item that can generate legal substrings to follow α. Thus, if the parser has viable prefix α on the stack, the input should reduce to Bβ (or γ for other item [B γ] in the closure). closure0(i): repeat if [A α Bβ] I and B γ is a rule then add [B γ] to I until no more items can be added to I return I Martin Sulzmann Syntax Analysis (Part 2) 14 / 42
goto0 goto0(i,x) is the closure of the set of all items [A αx β] such that [A α Xβ] I. If I is the set of valid items fro some viable prefix γ, then goto0(i,x) is the set of valid items for the viable prefix γx. goto0(i,x) represents the state after recognizing X in state I. goto0(i,x): let J be the set of items [A αx β] such that [A α Xβ] I return closure0(j) Martin Sulzmann Syntax Analysis (Part 2) 15 / 42
Example Consider P CP ǫ C agbe ae B acb a Assume I = {[C ag Be]}. Then (I omit [ and ] for simplicity) C ag Be, closure0(i) = B acb, B a Assume I = {[P ],[P CP]}. Then goto0(i,c) = P C P, P CP, P, C agbe, C ae Martin Sulzmann Syntax Analysis (Part 2) 16 / 42
Constructing the LR(0) Item Sets We introduce a new start symbol S where S S (some presentations assume S S$.) States = {closure0({s S})} while we can keep making changes for each I States and symbol X I = goto0(i,x) if I States then States = States {I } Martin Sulzmann Syntax Analysis (Part 2) 17 / 42
LR(0) Example Grammar: S E (1) E E + T (2) T (3) T id (4) (E) (5) CFSM: 0 E 1 T ( id 4 5 id ( id E + 2 T 8 ( + 6 LR(0) items: I 0 : S E I 4 : T id E E + T I 5 : T ( E) E T E E + T T id E T T (E) T id I 1 : S E T (E) E E +T I 6 : T (E ) I 2 : E E + T E E +T T id I 7 : T (E) T (E) I 8 : E T I 3 : E E + T T 3 7 ) Martin Sulzmann Syntax Analysis (Part 2) 18 / 42
Constructing the Parsing Table We have constructed LR(0) items and CFSM (state i of the CFSM is constructed from I i ). If A α aβ I i and goto0(i i,a)=i j then action(i,a)=shift j. If A α I then action(i,a) = reduce(a α) for all a. If S S I then action(i,$) = accept. If goto0(i i,a) = I j (where A V n ) then goto(i,a)=j. Set undefined entries in action and goto to error. Set initial state of parser to closure0([s S]). Martin Sulzmann Syntax Analysis (Part 2) 19 / 42
LR(0) Example 0 E 1 T ( id 4 5 id ( id E + 2 T T 8 ( + 3 7 6 ) state action goto id ( ) + $ S E T 0 s4 s5 1 8 1 acc 2 s4 s5 3 3 r2 r2 r2 r2 r2 4 r4 r4 r4 r4 r4 5 s4 s5 6 8 6 s7 s2 7 r5 r5 r5 r5 r5 8 r3 r3 r3 r3 r3 Martin Sulzmann Syntax Analysis (Part 2) 20 / 42
Another Example Consider S A Bc D A a b B a D bc Is this grammar LR(0)? Martin Sulzmann Syntax Analysis (Part 2) 21 / 42
Another Example (2) Grammar: S A Bc D A a b B a D bc LR(0) items: I 0 : S S I 3 : S B c S A I 4 : S Bc S Bc I 5 : S D S D I 6 : A a A a B a A b I 7 : A b B a D b c D bc... I 1 : S S I 2 : S A Martin Sulzmann Syntax Analysis (Part 2) 22 / 42
Another Example (3) Grammar: S A Bc D A a b B a D bc LR(0) items: I 0 : S S I 3 : S B c S A I 4 : S Bc S Bc I 5 : S D S D I 6 : A a reduce A a B a reduce conflict! A b I 7 : A b B a D b c D bc... I 1 : S S I 2 : S A Martin Sulzmann Syntax Analysis (Part 2) 23 / 42
Another Example (4) Grammar: S A Bc D A a b B a D bc LR(0) items: I 0 : S S I 3 : S B c S A I 4 : S Bc S Bc I 5 : S D S D I 6 : A a reduce A a B a reduce conflict! A b I 7 : A b reduce B a D b c shift conflict! D bc... I 1 : S S I 2 : S A Above grammar is not LR(0). Martin Sulzmann Syntax Analysis (Part 2) 24 / 42
LR Parsing Short Summary If LR(0) parsing tables contains multiply-defined action entries then G is not LR(0). There are two possible types of conflicts: shift-reduce reduce-reduce Conflicts can be resolved by looking ahead. Martin Sulzmann Syntax Analysis (Part 2) 25 / 42
SLR(1): Simple Lookahead LR SLR(1) Add lookaheads after building LR(0) item sets. We construct the SLR(1) parsing table as follows: If A α aβ I (a V t ) and goto0(i,a)=i i then action(i,a)=shift i. If A α I then action(i,a) = reduce(a α) for all a Follow(A). If S S I then action(i,$) = accept. If goto0(i i,a) = I j (where A V n ) then goto(i,a)=j. Set undefined entries in action and goto to error. Set initial state of parser to closure0([s S]). Martin Sulzmann Syntax Analysis (Part 2) 26 / 42
SLR(1) but not LR(0) Example S A (1) Bc (2) bc (3) Follow(S) = {$} A a (4) b (5) Follow(A) = {$} B a (6) Follow(B) = {c} I 0 : S S S A S Bc S bc A a A b B a I 1 : S S I 2 S A I 3 S B c I 4 S Bc I 5 A a B a I 6 A b S b c I 7 S Bc 6 c b 1 2 A S B 0 3 7 5 4 a c Martin Sulzmann Syntax Analysis (Part 2) 27 / 42
SLR(1) Parsing Table S A (1) Bc (2) bc (3) Follow(S) = {$} A a (4) b (5) Follow(A) = {$} B a (6) Follow(B) = {c} state action goto a b c $ S A B 0 s5 s6 1 2 3 1 acc 2 r1 3 s4 4 r2 5 r6 r4 6 s7 7 r2 Martin Sulzmann Syntax Analysis (Part 2) 28 / 42
Limitations of SLR(1) Consider S S S L = R R L R id R L LR(0) items: I 0 : S S I 1 : S S S L = R I 2 : S L = R shift! S R R L reduce! (= Follow(R)) L R L id R id Martin Sulzmann Syntax Analysis (Part 2) 29 / 42
Limitations of SLR(1) (cont d) Examples shows that it can be insufficient to consider Follow sets. Assume we are parsing id = id. In item 2, the shift action is the only correct choice (to reduce L to R gets us nowhere). Note that item R L came via S S R L (each symbol can only be followed by $). Hence, we need to make an effort to remember the actual right context. Martin Sulzmann Syntax Analysis (Part 2) 30 / 42
LR(1) Items Revisited Idea We can incorporate the extra right-context information into items. An LR(1) item is of the form [A α β,t] where t V t {$}. An item [A α,a] calls for the reduction A α only if the next input symbol is a. Martin Sulzmann Syntax Analysis (Part 2) 31 / 42
LR(1) Items Revisited (cont d) closure1 with Context closure1(i): while new items can be added if [A α Bγ,c] I and B δ is a rule then for each b First(γc) add [B δ,b] to I goto1 with Context goto1(i,x): let J be the set of items [A αx β,a] such that [A α Xβ,a] I return closure0(j) Initial State [S S,$] Martin Sulzmann Syntax Analysis (Part 2) 32 / 42
LR(1) Parsing Table Algorithm For each set I i of LR(1) items: If [A α aβ] I i and goto1(i i,a)=i j then action(i,a)=shift j. If [A α,b] I i (A S ) then action(i,b)= reduce(a α). If [S S,$] I i then action(i,$)= accept. If goto1(i i,a) = I j (where A V n ) then goto(i,a)=j. All remaining entries are set to error. Initial state contains [S S,$]. Martin Sulzmann Syntax Analysis (Part 2) 33 / 42
Recall Example Consider S S S L = R R L R id R L LR(1) items: I 0 : [S S,$] I 1 : [S S,$] [S L = R,$] I 2 : [S L = R,$] no = [S R,$] [R L,$] [L R,=] = due to First(= R) [L id,=] [R L,$] Martin Sulzmann Syntax Analysis (Part 2) 34 / 42
Another Example Grammar: S S (0) S CC (1) C cc (2) d (3) state action goto c d $ S C 0 s3 s4 1 2 1 acc 2 s6 s7 5 3 s3 s4 8 4 r3 r3 5 r1 6 s6 s7 9 7 r3 8 r2 r2 9 r2 LR(1) items: I 0 : S S,$ I 4 : C d,c d S CC,$ I 5 : S CC,$ C cc,c d I 6 : C c C,$ C d,c d C cc,$ I 1 : S S,$ C d,$ I 2 : S C C,$ I 7 : C d,$ C cc,$ I 8 : C cc,c d C d,$ I 9 : C cc,$ I 3 : C c C,c d C cc,c d C d,c d Martin Sulzmann Syntax Analysis (Part 2) 35 / 42
Another Example (2) Grammar: S S (0) S CC (1) C cc (2) d (3) state action goto c d $ S C 0 s3 s4 1 2 1 acc 2 s6 s7 5 3 s3 s4 8 4 r3 r3 5 r1 6 s6 s7 9 7 r3 8 r2 r2 9 r2 LR(1) items: I 0 : S S,$ I 4 : C d,c d S CC,$ I 5 : S CC,$ C cc,c d I 6 : C c C,$ C d,c d C cc,$ I 1 : S S,$ C d,$ I 2 : S C C,$ I 7 : C d,$ C cc,$ I 8 : C cc,c d C d,$ I 9 : C cc,$ I 3 : C c C,c d C cc,c d C d,c d I 3 and I 6 have the same core! Martin Sulzmann Syntax Analysis (Part 2) 36 / 42
LALR: Lookahead LR Idea Define the core of a set of LR(1) items to be the set of LR(0) items derived by ignoring the lookahead symbols. E.g. the two sets {[C c C,c d],[c cc,c d],[c d,c d]} {[C c C,$],[C cc,$],[c d,$]} have the same core. Key idea: If two sets of LR(1) items, I i and I j, have the same core, we can merge the states that represent them in the action and goto tables. Martin Sulzmann Syntax Analysis (Part 2) 37 / 42
LALR(1) algorithm Algorithm Construct LR(1) items. Find all cores among sets of LR(1) items, replace these sets by their union, update goto function (incrementally). For each I i If [A α aβ,b] I i and goto1(i i,a)=i j then action(i,a)=shift j. If [A α,b] I i (A S ) then action(i,b)= reduce(a α). If [S S,$] I i then action(i,$)= accept. If goto1(i i,a) = I j (where A V n ) then goto(i,a)=j. All remaining entries are set to error. Initial state = closure1({[s S,$]}). More efficient LALR construction possible, see textbook. Martin Sulzmann Syntax Analysis (Part 2) 38 / 42
Recall Example Grammar: S S (0) S CC (1) C cc (2) d (3) state action goto c d $ S C 0 s3 s4 1 2 1 acc 2 s6 s7 5 3 s3 s4 8 4 r3 r3 5 r1 6 s6 s7 9 7 r3 8 r2 r2 9 r2 LR(1) items: I 0 : S S,$ I 4 : C d,c d S CC,$ I 5 : S CC,$ C cc,c d I 6 : C c C,$ C d,c d C cc,$ I 1 : S S,$ C d,$ I 2 : S C C,$ I 7 : C d,$ C cc,$ I 8 : C cc,c d C d,$ I 9 : C cc,$ I 3 : C c C,c d C cc,c d C d,c d Martin Sulzmann Syntax Analysis (Part 2) 39 / 42
Recall Example (2) Grammar: S S (0) S CC (1) C cc (2) d (3) state action goto c d $ S C 0 s3 s4 1 2 1 acc 2 s6 s7 5 3 s3 s4 8 4 r3 r3 5 r1 6 s6 s7 9 7 r3 8 r2 r2 9 r2 LR(1) items: I 0 : S S,$ I 4 : C d,c d S CC,$ I 5 : S CC,$ C cc,c d I 6 : C c C,$ C d,c d C cc,$ I 1 : S S,$ C d,$ I 2 : S C C,$ I 7 : C d,$ C cc,$ I 8 : C cc,c d C d,$ I 9 : C cc,$ I 3 : C c C,c d C cc,c d C d,c d Merged states: I 36 : C c C,c d $ C cc,c d $ C d,c d $ I 47 : C d,c d $ I 89 : C cc,c d $ Martin Sulzmann Syntax Analysis (Part 2) 40 / 42
Recall Example (3) Grammar: S S (0) S CC (1) C cc (2) d (3) Updated table: state action goto c d $ S C 0 s36 s47 1 2 1 acc 2 s36 s47 5 36 s36 s47 89 47 r3 r3 r3 5 r1 89 r2 r2 r2 LR(1) items: I 0 : S S,$ I 4 : C d,c d S CC,$ I 5 : S CC,$ C cc,c d I 6 : C c C,$ C d,c d C cc,$ I 1 : S S,$ C d,$ I 2 : S C C,$ I 7 : C d,$ C cc,$ I 8 : C cc,c d C d,$ I 9 : C cc,$ I 3 : C c C,c d C cc,c d C d,c d Merged states: I 36 : C c C,c d $ C cc,c d $ C d,c d $ I 47 : C d,c d $ I 89 : C cc,c d $ Martin Sulzmann Syntax Analysis (Part 2) 41 / 42
Summary LR more powerful than LL. ANTLR is a pretty cool Java-based LL parser. LALR good enough in practice (see yacc). Precedence can also be used to resolve shift/reduce conflicts. E.g. higher precedence shift, left associative reduce. Error recovery important (but no time!), please consult text book. Martin Sulzmann Syntax Analysis (Part 2) 42 / 42