Everything You Always Wanted to Know About Parsing

Everything You Always Wanted to Know About Parsing Part V : LR Parsing University of Padua, Italy ESSLLI, August 2013

Introduction Parsing strategies classified by the time the associated PDA commits to a production in the analysis : B A B β Top-Down Bottom-Up Left-Corner

Introduction LR parsing discovered in 1965 in an article by Donald Knuth, California Institute of Technology, Pasadena, California L stands for left-to-right parsing and R stands for reversed rightmost derivation Works in linear time for deterministic context-free languages The algorithm consists of a shift-reduce, deterministic PDA called LR PDA, realizing a bottom-up parsing strategy Widely used for parsing of programming languages, where LR PDAs are mechanically generated from a formal grammar

Introduction parsing (GLR for short) extends the LR method to handle nondeterministic and ambiguous CFGs First discovered in 1985 by Masaru Tomita in his PhD dissertation, Carnegie Mellon University Very popular in computational linguistics in the late eighties

Introduction The GLR algorithm consists of the LR PDA as in Knuth original proposal the definition of a technique for simulating the LR PDA, based on a data structure called graph structured stack We disregard the graph structured stack, and apply our tabulation algorithm to simulate the LR PDA The outcome is an optimized version of the original GLR parsing algorithm

Stack Symbols Let P be a set of CFG productions; a stack symbol of the LR PDA is a non-empty subset of I P = {A α 1 α 2 (A α 1 α 2 ) P} satisfying some additional property, to be introduced later Informally, a stack symbol represents a disjunction of dotted rules; in this way several alternative parsing analyses can be carried out simultaneously

Stack Symbols Example : Stack symbol q = {A B C γ, A C γ } represents two alternative analyses; observe how the strings preceding the dots are suffixes of a common string S S A a 1 a j BC γ a j+1 a i B a 1 a j a j+1 a j A A C γ a j +1 a i

Stack Symbols We need a function closure to precompute the closure of the prediction step from the Earley algorithm Let q I P ; closure(q) is the smallest set such that : q closure(q); and if (A α Bα ) closure(q) and (B β) P then (B β) closure(q)

Stack Symbols Example : if P consists of the productions : S S + S, S a we have closure({s S + S}) = {S S + S, S S + S, S a}

Stack Symbols We need a function goto to precompute the scanner and the completer steps from the Earley algorithm, followed by closure() Let q I P and X N Σ; we have goto(q, X ) = closure({a αx α (A α X α ) q})

Stack Symbols Example (cont d) : if P consists of the productions : S S + S, S a we have goto({s S + S, S S + S, S a}, a) = {S a } goto({s S + S, S S + S, S a}, S) = {S S + S, S S +S}

Stack Symbols The initial stack symbol of the LR PDA is = closure({s α (S α) P}) The final stack symbol of the LR PDA is a special symbol q fin ;

Stack Symbols The stack alphabet Q LR of the LR PDA is the smallest set satisfying : Q LR ; q fin Q LR ; and if q Q LR and goto(q, X ) for some X N Σ, then q Q LR

Stack Symbols Example (cont d) : P consists of the productions S S + S and S a The goto function, also called the LR table, excluding q fin : S S S + S S a q 2 S S +S + S S + S S S + S S a q 3 S + S S + S S S +S q 4 a q 1 S a a By construction, in each state the strings preceding the dots are suffixes of a common string

LR PDA The LR PDA M LR is constructed from G as follows The input alphabet of M LR is the terminal alphabet of G The stack alphabet of M LR is Q LR, as previously defined

LR PDA Set LR contains transitions having one of the following forms : shift q 1 a q 1 q 2 for each q 1, q 2 and a with q 2 = goto(q 1, a) reduce q 0 q 1 q m ε q 0 q for each q 0, q 1,, q m, q and A α with (A α ) q m, α = m and q = goto(q 0, A) reduce (special case) q 1 q m ε q fin for each q 1,, q m and S α with (S α ) q m and α = m

LR PDA Each stack symbol of M LR records the set of rules that are compatible with the sequence of nonterminals recognized so far Thus alternative analyses within a stack symbol match suffixes of the sequence of nonterminals recognized so far; we call this the suffix property The implemented strategy is bottom-up : the commitment to a production occurs as late as possible, after all of the production s right-hand side symbols have been recognized

Computational Complexity A production A α has exactly Aα dotted items; therefore I P = G It is not difficult to construct a worst case set P such that Q LR = O(exp( G )) M LR = O(exp( G ))

M LR Computation Example (cont d) : input string w = a + a + a S S S + S S S S + S S + S a a S + S a a a a

M LR Computation Example (cont d) : input string w = a + a + a, first computation reduce S a q 3 q 1 q 2 q 2 a ε + 0 1 1 2

M LR Computation Example (cont d) : input string w = a + a + a, first computation reduce S a reduce S S + S q 1 q 4 q 3 q 3 q 3 q 2 q 2 q 2 q 2 a ε ε + 3 3 3 4

M LR Computation Example (cont d) : input string w = a + a + a, first computation reduce S a reduce S S + S q 1 q 4 q 3 q 3 q 2 q 2 a ε ε q fin 5 5 5

M LR Computation Example (cont d) : input string w = a + a + a, second computation reduce S a q 3 q 1 q 2 q 2 a ε + 0 1 1 2

M LR Computation q 1 reduce S a q 3 q 3 q 1 q 4 q 4 q 4 q 3 q 3 q 3 q 3 q 2 q 2 q 2 q 2 a ε + a 3 3 4 5

M LR Computation reduce S a q 4 q 3 q 4 reduce S S + S q 4 q 3 q 3 q 2 q 2 reduce S S + S ε ε ε q fin 5 5 5

M E and M LR Comparison Example : Earley PDA on sample rule A 0 A 1 A 2 A 3 (simplified notation) [ A 1A 2A 3] [A 1 A 2A 3] [A 1A 2 A 3] [A 1A 2A 3 ].... i 1 i 2 i 3 i 4

M E and M LR Example (cont d) : LR PDAs on sample rule A 0 A 1 A 2 A 3 (simplified notation) {A 1A 2A 3 } {A 1A 2 A 3} {A 1A 2 A 3} {A 1 A 2A 3} {A 1 A 2A 3} {A 1 A 2A 3} { A 1A 2A 3} { A 1A 2A 3} { A 1A 2A 3} { A 1A 2A 3}.... i 1 i 2 i 3 i 4

M E and M LR Example (cont d) : M E tabulation on sample rule A 0 A 1 A 2 A 3 A 1 A 2A 3 A 1A 2 A 3 A 1A 2A 3 i 1 A 1 i 2 A 2 i 3 A 3 i 4 No need to reduce

M E and M LR Example (cont d) : M LR tabulation on sample rule A 0 A 1 A 2 A 3 reduce A 1A 2A 3 A 1 A 2A 3 A 1A 2 A 3 A 1A 2A 3 i 1 A 1 i 2 A 2 i 3 A 3 i 4 Single step reduction should be avoided; this issue will be addressed later

Remarks M LR is deterministic only for so-called deterministic context-free languages For general CFGs, M LR is strictly nondeterministic, with an exponential proliferation of computations in the length of the input string But this needs not concern us : the only purpose of M LR is to specify a parsing strategy nondeterminism can be simulated using (an extension of) our deterministic tabulation algorithm

GLR Algorithm We derive the GLR algorithm by tabulation of the LR PDA Items in the GLR algorithm have the form (q, j, q, i) In this case we can not drop the first item component as we did for the CKY and Earley algorithm

Deduction Rules We overview the GLR tabular algorithm using inference rules M LR starts with in the stack. This corresponds to the init step of the tabular algorithm (, 0,, 0) (, ) 0

Deduction Rules The shift transition q 1 a q 1 q 2 can be implemented as (q, j, q 1, i 1) (q 1, i 1, q 2, i) { a = ai (q, q 1 ) (q 1, q 2 ) j i 1 a i

Deduction Rules The reduce transition (q 0 q 1 q m 1 q m ε q 0 q ) is in a form more general than the reduce transition available for our tabulation algorithm This reduction can be implemented as (q 0, j 0, q 1, j 1 ) (q 1, j 1, q 2, j 2 ). (q m 1, j m 1, q m, j m ) (q 0, j 0, q, j m )

Deduction Rules The deduction rule for the reduce transition (q 0 q 1 q m 1 q m ε q 0 q ) can be graphically represented as (q 0, q ) (q 0, q 1) (q 1, q 2) j 0 j 1 j 2 (q m 1, q m) i m 1 i m

GLR Pseudocode Algorithm 5 GLR Algorithm (from Tabulation of LR PDA) 1: T {(, 0,, 0)} 2: for i 1,..., n do edges with right end at i 3: D 4: for each (q, j, q 1, i 1) T do 5: for each (q 1 ai q 1 q 2 ) do shift step 6: T T {(q 1, i 1, q 2, i)} 7: D D {(q 1, i 1, q 2, i)} 8: end for 9: end for

Pseudocode 10: while D do 11: pop (q m 1, j m 1, q m, j m ) from D 12: for each (q 0 q 1 q m ε q 0 q ) do reduce step 13: for each (q 0, j 0, q 1, j 1 ), (q 1, j 1, q 2, j 2 ),..., (q m 2, j m 2, q m 1, j m 1 ) T do 14: if (q 0, j 0, q, j m ) T then 15: T T {(q 0, j 0, q, j m )} 16: D D {(q 0, j 0, q, j m )} 17: end if 18: end for 19: end for

Pseudocode 20: for each ( q 1 q m ε q fin ) do reduce step 21: for each (, 0,, 0), (, 0, q 1, j 1 ),..., (q m 2, j m 2, q m 1, j m 1 ) T do 22: if (, 0, q fin, j m ) T then 23: T T {(, 0, q fin, j m )} 24: D D {(, 0, q fin, j m )} 25: end if 26: end for 27: end for 28: end while 29: end for 30: accept iff (, 0, q fin, n) T

Computational Complexity There are two different sources of complexity in the GLR algorithm The first is related to the worst case exponential size of the set of stack symbols Q LR ; see our previous analysis This effect can be alleviated by using optimization techniques that simplify item representation for certain rules

Computational Complexity The second source of complexity is related to the length of the input string; let m be the length of the longest rule in the grammar The reduce step involves m + 1 indices and can be instantiated a number of times O( w m+1 ), resulting in exponential time complexity in the grammar size Such an exponential behavior can be avoided by factorizing each reduce transition into a sequence of pop transitions with only two stack symbols in the left-hand side