Everything You Always Wanted to Know About Parsing

Similar documents
Everything You Always Wanted to Know About Parsing

Einführung in die Computerlinguistik

COM364 Automata Theory Lecture Note 2 - Nondeterminism

CPS 220 Theory of Computation Pushdown Automata (PDA)

60-354, Theory of Computation Fall Asish Mukhopadhyay School of Computer Science University of Windsor

Harvard CS 121 and CSCI E-207 Lecture 10: Ambiguity, Pushdown Automata

MA/CSSE 474 Theory of Computation

Pushdown Automata (Pre Lecture)

Push-down Automata = FA + Stack

Parsing. Context-Free Grammars (CFG) Laura Kallmeyer. Winter 2017/18. Heinrich-Heine-Universität Düsseldorf 1 / 26

Pushdown Automata: Introduction (2)

LR2: LR(0) Parsing. LR Parsing. CMPT 379: Compilers Instructor: Anoop Sarkar. anoopsarkar.github.io/compilers-class

NPDA, CFG equivalence

Parsing -3. A View During TD Parsing

SCHEME FOR INTERNAL ASSESSMENT TEST 3

CPS 220 Theory of Computation

1. Draw a parse tree for the following derivation: S C A C C A b b b b A b b b b B b b b b a A a a b b b b a b a a b b 2. Show on your parse tree u,

Lecture 11 Sections 4.5, 4.7. Wed, Feb 18, 2009

(pp ) PDAs and CFGs (Sec. 2.2)

Shift-Reduce parser E + (E + (E) E [a-z] In each stage, we shift a symbol from the input to the stack, or reduce according to one of the rules.

CSE 105 THEORY OF COMPUTATION

PUSHDOWN AUTOMATA (PDA)

CS 406: Bottom-Up Parsing

Definition: A grammar G = (V, T, P,S) is a context free grammar (cfg) if all productions in P have the form A x where

CSE 105 THEORY OF COMPUTATION

CISC4090: Theory of Computation

Parsing. Left-Corner Parsing. Laura Kallmeyer. Winter 2017/18. Heinrich-Heine-Universität Düsseldorf 1 / 17

Outline. CS21 Decidability and Tractability. Machine view of FA. Machine view of FA. Machine view of FA. Machine view of FA.

(pp ) PDAs and CFGs (Sec. 2.2)

Context-Free Languages (Pre Lecture)

Introduction to Formal Languages, Automata and Computability p.1/42

Context-Free and Noncontext-Free Languages

Final exam study sheet for CS3719 Turing machines and decidability.

Pushdown Automata (2015/11/23)

CFGs and PDAs are Equivalent. We provide algorithms to convert a CFG to a PDA and vice versa.

Bottom-Up Parsing. Ÿ rm E + F *idÿ rm E +id*idÿ rm T +id*id. Ÿ rm F +id*id Ÿ rm id + id * id

Harvard CS 121 and CSCI E-207 Lecture 10: CFLs: PDAs, Closure Properties, and Non-CFLs

THEORY OF COMPUTATION (AUBER) EXAM CRIB SHEET

Deterministic Finite Automata. Non deterministic finite automata. Non-Deterministic Finite Automata (NFA) Non-Deterministic Finite Automata (NFA)

Theory of Computation (IV) Yijia Chen Fudan University

October 6, Equivalence of Pushdown Automata with Context-Free Gramm

Introduction to Bottom-Up Parsing

Lecture 17: Language Recognition

Chapter 4: Bottom-up Analysis 106 / 338

Introduction to Bottom-Up Parsing

CS Pushdown Automata

Pushdown Automata. We have seen examples of context-free languages that are not regular, and hence can not be recognized by finite automata.

CMSC 330: Organization of Programming Languages. Pushdown Automata Parsing

5 Context-Free Languages

CMPT-825 Natural Language Processing. Why are parsing algorithms important?

Theory of Computation (VI) Yijia Chen Fudan University

Introduction to Bottom-Up Parsing

SYLLABUS. Introduction to Finite Automata, Central Concepts of Automata Theory. CHAPTER - 3 : REGULAR EXPRESSIONS AND LANGUAGES

Administrivia. Test I during class on 10 March. Bottom-Up Parsing. Lecture An Introductory Example

Introduction to Bottom-Up Parsing

Foundations of Informatics: a Bridging Course

Pushdown automata. Twan van Laarhoven. Institute for Computing and Information Sciences Intelligent Systems Radboud University Nijmegen

MA/CSSE 474 Theory of Computation

FORMAL LANGUAGES, AUTOMATA AND COMPUTABILITY

Miscellaneous. Closure Properties Decision Properties

Compiler Construction

Computational Models - Lecture 4

CS481F01 Prelim 2 Solutions

An Efficient Context-Free Parsing Algorithm. Speakers: Morad Ankri Yaniv Elia

UNIT-VIII COMPUTABILITY THEORY

HKN CS/ECE 374 Midterm 1 Review. Nathan Bleier and Mahir Morshed

THEORY OF COMPILATION

Handout 8: Computation & Hierarchical parsing II. Compute initial state set S 0 Compute initial state set S 0

Theory of Computation (I) Yijia Chen Fudan University

Pushdown Automata. Reading: Chapter 6

Lecture 1: Finite State Automaton

LR(1) Parsers Part III Last Parsing Lecture. Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved.

Theory of Computation (II) Yijia Chen Fudan University

Functions on languages:

Section 1 (closed-book) Total points 30

Introduction to Theory of Computing

Lecture VII Part 2: Syntactic Analysis Bottom-up Parsing: LR Parsing. Prof. Bodik CS Berkley University 1

Generalized Bottom Up Parsers With Reduced Stack Activity

CS5371 Theory of Computation. Lecture 7: Automata Theory V (CFG, CFL, CNF)

Introduction to Finite-State Automata

Automata and Languages

Pushdown Automata. Chapter 12

St.MARTIN S ENGINEERING COLLEGE Dhulapally, Secunderabad

Parsing beyond context-free grammar: Parsing Multiple Context-Free Grammars

CDM Parsing and Decidability

Languages. Languages. An Example Grammar. Grammars. Suppose we have an alphabet V. Then we can write:

Context-Free Grammars and Languages. Reading: Chapter 5

Computability and Complexity

Compiler Design 1. LR Parsing. Goutam Biswas. Lect 7

CS20a: summary (Oct 24, 2002)

Accept or reject. Stack

Properties of Context-Free Languages. Closure Properties Decision Properties

Grammar formalisms Tree Adjoining Grammar: Formal Properties, Parsing. Part I. Formal Properties of TAG. Outline: Formal Properties of TAG

Computational Models - Lecture 5 1

EXAM. CS331 Compiler Design Spring Please read all instructions, including these, carefully

Natural Language Processing. Lecture 13: More on CFG Parsing

CISC 4090 Theory of Computation

Pushdown Automata. Pushdown Automata. Pushdown Automata. Pushdown Automata. Pushdown Automata. Pushdown Automata. The stack

Intro to Theory of Computation

MTH401A Theory of Computation. Lecture 17

Transcription:

Everything You Always Wanted to Know About Parsing Part V : LR Parsing University of Padua, Italy ESSLLI, August 2013

Introduction Parsing strategies classified by the time the associated PDA commits to a production in the analysis : B A B β Top-Down Bottom-Up Left-Corner

Introduction LR parsing discovered in 1965 in an article by Donald Knuth, California Institute of Technology, Pasadena, California L stands for left-to-right parsing and R stands for reversed rightmost derivation Works in linear time for deterministic context-free languages The algorithm consists of a shift-reduce, deterministic PDA called LR PDA, realizing a bottom-up parsing strategy Widely used for parsing of programming languages, where LR PDAs are mechanically generated from a formal grammar

Introduction parsing (GLR for short) extends the LR method to handle nondeterministic and ambiguous CFGs First discovered in 1985 by Masaru Tomita in his PhD dissertation, Carnegie Mellon University Very popular in computational linguistics in the late eighties

Introduction The GLR algorithm consists of the LR PDA as in Knuth original proposal the definition of a technique for simulating the LR PDA, based on a data structure called graph structured stack We disregard the graph structured stack, and apply our tabulation algorithm to simulate the LR PDA The outcome is an optimized version of the original GLR parsing algorithm

Stack Symbols Let P be a set of CFG productions; a stack symbol of the LR PDA is a non-empty subset of I P = {A α 1 α 2 (A α 1 α 2 ) P} satisfying some additional property, to be introduced later Informally, a stack symbol represents a disjunction of dotted rules; in this way several alternative parsing analyses can be carried out simultaneously

Stack Symbols Example : Stack symbol q = {A B C γ, A C γ } represents two alternative analyses; observe how the strings preceding the dots are suffixes of a common string S S A a 1 a j BC γ a j+1 a i B a 1 a j a j+1 a j A A C γ a j +1 a i

Stack Symbols We need a function closure to precompute the closure of the prediction step from the Earley algorithm Let q I P ; closure(q) is the smallest set such that : q closure(q); and if (A α Bα ) closure(q) and (B β) P then (B β) closure(q)

Stack Symbols Example : if P consists of the productions : S S + S, S a we have closure({s S + S}) = {S S + S, S S + S, S a}

Stack Symbols We need a function goto to precompute the scanner and the completer steps from the Earley algorithm, followed by closure() Let q I P and X N Σ; we have goto(q, X ) = closure({a αx α (A α X α ) q})

Stack Symbols Example (cont d) : if P consists of the productions : S S + S, S a we have goto({s S + S, S S + S, S a}, a) = {S a } goto({s S + S, S S + S, S a}, S) = {S S + S, S S +S}

Stack Symbols The initial stack symbol of the LR PDA is = closure({s α (S α) P}) The final stack symbol of the LR PDA is a special symbol q fin ;

Stack Symbols The stack alphabet Q LR of the LR PDA is the smallest set satisfying : Q LR ; q fin Q LR ; and if q Q LR and goto(q, X ) for some X N Σ, then q Q LR

Stack Symbols Example (cont d) : P consists of the productions S S + S and S a The goto function, also called the LR table, excluding q fin : S S S + S S a q 2 S S +S + S S + S S S + S S a q 3 S + S S + S S S +S q 4 a q 1 S a a By construction, in each state the strings preceding the dots are suffixes of a common string

LR PDA The LR PDA M LR is constructed from G as follows The input alphabet of M LR is the terminal alphabet of G The stack alphabet of M LR is Q LR, as previously defined

LR PDA Set LR contains transitions having one of the following forms : shift q 1 a q 1 q 2 for each q 1, q 2 and a with q 2 = goto(q 1, a) reduce q 0 q 1 q m ε q 0 q for each q 0, q 1,, q m, q and A α with (A α ) q m, α = m and q = goto(q 0, A) reduce (special case) q 1 q m ε q fin for each q 1,, q m and S α with (S α ) q m and α = m

LR PDA Each stack symbol of M LR records the set of rules that are compatible with the sequence of nonterminals recognized so far Thus alternative analyses within a stack symbol match suffixes of the sequence of nonterminals recognized so far; we call this the suffix property The implemented strategy is bottom-up : the commitment to a production occurs as late as possible, after all of the production s right-hand side symbols have been recognized

Computational Complexity A production A α has exactly Aα dotted items; therefore I P = G It is not difficult to construct a worst case set P such that Q LR = O(exp( G )) M LR = O(exp( G ))

M LR Computation Example (cont d) : input string w = a + a + a S S S + S S S S + S S + S a a S + S a a a a

M LR Computation Example (cont d) : input string w = a + a + a, first computation reduce S a q 3 q 1 q 2 q 2 a ε + 0 1 1 2

M LR Computation Example (cont d) : input string w = a + a + a, first computation reduce S a reduce S S + S q 1 q 4 q 3 q 3 q 3 q 2 q 2 q 2 q 2 a ε ε + 3 3 3 4

M LR Computation Example (cont d) : input string w = a + a + a, first computation reduce S a reduce S S + S q 1 q 4 q 3 q 3 q 2 q 2 a ε ε q fin 5 5 5

M LR Computation Example (cont d) : input string w = a + a + a, second computation reduce S a q 3 q 1 q 2 q 2 a ε + 0 1 1 2

M LR Computation q 1 reduce S a q 3 q 3 q 1 q 4 q 4 q 4 q 3 q 3 q 3 q 3 q 2 q 2 q 2 q 2 a ε + a 3 3 4 5

M LR Computation reduce S a q 4 q 3 q 4 reduce S S + S q 4 q 3 q 3 q 2 q 2 reduce S S + S ε ε ε q fin 5 5 5

M E and M LR Comparison Example : Earley PDA on sample rule A 0 A 1 A 2 A 3 (simplified notation) [ A 1A 2A 3] [A 1 A 2A 3] [A 1A 2 A 3] [A 1A 2A 3 ].... i 1 i 2 i 3 i 4

M E and M LR Example (cont d) : LR PDAs on sample rule A 0 A 1 A 2 A 3 (simplified notation) {A 1A 2A 3 } {A 1A 2 A 3} {A 1A 2 A 3} {A 1 A 2A 3} {A 1 A 2A 3} {A 1 A 2A 3} { A 1A 2A 3} { A 1A 2A 3} { A 1A 2A 3} { A 1A 2A 3}.... i 1 i 2 i 3 i 4

M E and M LR Example (cont d) : M E tabulation on sample rule A 0 A 1 A 2 A 3 A 1 A 2A 3 A 1A 2 A 3 A 1A 2A 3 i 1 A 1 i 2 A 2 i 3 A 3 i 4 No need to reduce

M E and M LR Example (cont d) : M LR tabulation on sample rule A 0 A 1 A 2 A 3 reduce A 1A 2A 3 A 1 A 2A 3 A 1A 2 A 3 A 1A 2A 3 i 1 A 1 i 2 A 2 i 3 A 3 i 4 Single step reduction should be avoided; this issue will be addressed later

Remarks M LR is deterministic only for so-called deterministic context-free languages For general CFGs, M LR is strictly nondeterministic, with an exponential proliferation of computations in the length of the input string But this needs not concern us : the only purpose of M LR is to specify a parsing strategy nondeterminism can be simulated using (an extension of) our deterministic tabulation algorithm

GLR Algorithm We derive the GLR algorithm by tabulation of the LR PDA Items in the GLR algorithm have the form (q, j, q, i) In this case we can not drop the first item component as we did for the CKY and Earley algorithm

Deduction Rules We overview the GLR tabular algorithm using inference rules M LR starts with in the stack. This corresponds to the init step of the tabular algorithm (, 0,, 0) (, ) 0

Deduction Rules The shift transition q 1 a q 1 q 2 can be implemented as (q, j, q 1, i 1) (q 1, i 1, q 2, i) { a = ai (q, q 1 ) (q 1, q 2 ) j i 1 a i

Deduction Rules The reduce transition (q 0 q 1 q m 1 q m ε q 0 q ) is in a form more general than the reduce transition available for our tabulation algorithm This reduction can be implemented as (q 0, j 0, q 1, j 1 ) (q 1, j 1, q 2, j 2 ). (q m 1, j m 1, q m, j m ) (q 0, j 0, q, j m )

Deduction Rules The deduction rule for the reduce transition (q 0 q 1 q m 1 q m ε q 0 q ) can be graphically represented as (q 0, q ) (q 0, q 1) (q 1, q 2) j 0 j 1 j 2 (q m 1, q m) i m 1 i m

GLR Pseudocode Algorithm 5 GLR Algorithm (from Tabulation of LR PDA) 1: T {(, 0,, 0)} 2: for i 1,..., n do edges with right end at i 3: D 4: for each (q, j, q 1, i 1) T do 5: for each (q 1 ai q 1 q 2 ) do shift step 6: T T {(q 1, i 1, q 2, i)} 7: D D {(q 1, i 1, q 2, i)} 8: end for 9: end for

Pseudocode 10: while D do 11: pop (q m 1, j m 1, q m, j m ) from D 12: for each (q 0 q 1 q m ε q 0 q ) do reduce step 13: for each (q 0, j 0, q 1, j 1 ), (q 1, j 1, q 2, j 2 ),..., (q m 2, j m 2, q m 1, j m 1 ) T do 14: if (q 0, j 0, q, j m ) T then 15: T T {(q 0, j 0, q, j m )} 16: D D {(q 0, j 0, q, j m )} 17: end if 18: end for 19: end for

Pseudocode 20: for each ( q 1 q m ε q fin ) do reduce step 21: for each (, 0,, 0), (, 0, q 1, j 1 ),..., (q m 2, j m 2, q m 1, j m 1 ) T do 22: if (, 0, q fin, j m ) T then 23: T T {(, 0, q fin, j m )} 24: D D {(, 0, q fin, j m )} 25: end if 26: end for 27: end for 28: end while 29: end for 30: accept iff (, 0, q fin, n) T

Computational Complexity There are two different sources of complexity in the GLR algorithm The first is related to the worst case exponential size of the set of stack symbols Q LR ; see our previous analysis This effect can be alleviated by using optimization techniques that simplify item representation for certain rules

Computational Complexity The second source of complexity is related to the length of the input string; let m be the length of the longest rule in the grammar The reduce step involves m + 1 indices and can be instantiated a number of times O( w m+1 ), resulting in exponential time complexity in the grammar size Such an exponential behavior can be avoided by factorizing each reduce transition into a sequence of pop transitions with only two stack symbols in the left-hand side