An Efficient Context-Free Parsing Algorithm. Speakers: Morad Ankri Yaniv Elia

An Efficient Context-Free Parsing Algorithm Speakers: Morad Ankri Yaniv Elia

Yaniv: Introduction Terminology Informal Explanation The Recognizer Morad: Example Time and Space Bounds Empirical results Practical Use Outline

Introduction: The Author

Grammar Introduction cont. The rules governing the use of a language Types of grammar: regular expressions Context-free Context-sensitive Recursively Enumerable

Introduction cont. Chomsky Grammars Hierarchy: Recursively Enumerable (Any) a n b n c n a n b n a*b* Context Free (A-> abc) Regular Expression (S->aB) Context Sensitive (AB-> CD)

Introduction cont. Representing Sentence Structure: Not just FSTs! Issue: Recursion Potentially infinite: a + a + a +.. Capture constituent structure: Basic units => Terminals Subcategorization => Non Terminals Hierarchical => Parse Tree

Introduction cont. Context-free Grammars (BNF grammars) Allows a simple and precise description of sentences which are built from smaller blocks Why "context-free? Non-terminals can be rewritten without regard to the context in which they occur Parsing Algorithms for these grammars play a large role in compilers and interpreters implementation (e.g. Yacc, Bison, JavaCC)

Introduction cont. Parsing Algorithms types: General Algorithms: handle all context-free grammars Restricted Algorithms: handle sub-classes of grammars Tend to be more efficient

Introduction cont. Earley s Algorithm is more efficient than all other parsing algorithms: can parse all context-free languages executes in cubic time O(n 3 ) in the general case O(n 2 ) for unambiguous grammars linear time for almost all LR(k) grammars It performs particularly well when the rules are written left-recursively

Language Terminology A set of strings over a finite set of terminal symbols. These terminal Symbols are represented by lowercase letters: a, b, c Non-terminal Symbols syntactic classes Represented by Capital letters: A, B, C

Terminology - cont. Strings of either terminals or non-terminals are represented by Greek letters: α, β, γ The empty string is λ. α k = α, α,, α (k times) α is the number of symbols in α.

Terminology - cont. Productions/rewriting rules A finite set of rules Represented as : A α The root of the grammar A non-terminal which stands for "sentence Alternatives The productions with a particular non-terminal D on their left sides

Terminology - cont. Example: T P T T * P P a Root Terminals Non Production Alternative Terminals Rule

Terminology - cont. Given a context-free grammar G: α => β There are γ, δ, η, A s.t. α = γaδ, β = γηδ and A η is a production α = * > β (β is derived from α) There are strings α 0, α 1,, α m s.t. α = α 0 => α 1 => => α m = β The sequence α 0, α 1,, α m is called a derivation

Terminology - cont. sentential form a string α s.t. α is derived from the root of the grammar ( R = * > α ) Sentence a sentential form consisting entirely of terminals Derivation tree (a.k.a. parse tree) a representation of a sentential from reflecting the steps made in deriving it

Terminology - cont. Example: a * a + a E => E + T (E E + T) => T + T (E T) => T + P (T P) => T * P + P (T T * P) => P * P + P (T P) => a * P + P (P a) => a * a + P (P a) => a * a + a (P a) E E + T T P T * P P a a a

Terminology - cont. Note: a derivation tree is not unique for a derivation! E => E + T (E E + T) => E + P (T P) => T + P (E T) =>T * P + P (T T * P) => P * P + P (T P) => a * P + P (P a) => a * a + P (P a) => a * a + a (P a) E E + T T P T * P P a a a

Terminology - cont. Note: a derivation tree is not unique for a derivation! A Parse Tree represents the steps deriving it, but E not their order! E => E + T (E E + T) => E + T (E E + T) => E + P (T P) => T + T (E T) => T + P (E T) => T + P (T P) =>T * P + P (T T * P) => T * P + P (T T * P) => P * P + P (T P) => P * P + P (T P) => a * P + P (P a) => a * P + P (P a) => a * a + P (P a) => a * a + P (P a) => a * a + a (P a) => a * a + a (P a)

Degree of ambiguity Terminology - cont. number of distinct derivation trees of a sentence Unambiguous sentence a sentence whose degree of ambiguity is 1 Unambiguous grammar contains only unambiguous sentences Bounded unambiguity a grammar with a bound b on the degree of ambiguity

The recognizer Terminology - cont. An algorithm which take a string as input Accepts/rejects it depending on whether or not the string is a sentence of the grammar The parser A recognizer which also outputs the set of all legal derivation trees for the string

Informal Explanation How does the recognizer work? Scans an input string X 1, X 2,, X n from left to right looking ahead some fixed number k of symbols As each symbol X i is scanned, a set of states S i is constructed representing the condition of the recognition process at that point in the scan

Informal Explanation Each state in the set represents a production s.t. we are currently scanning a portion of the input string which is derived from its right side a point in that production which shows how much of the production's right side we have recognized so far a k-symbol string which is a syntactically allowed successor to that instance of the production a pointer back to the position in the input string at which we began to look for that instance of the production

Example: Informal Explanation In grammar AE, with k = 1, S o starts as the single state Φ. E & & 0 new new non-terminal terminal production rule point K-symbol string (k=1) Pointer back to input string position

Informal Explanation Uses dynamic programming to do parallel top-down search in (worst case) O(N 3 ) time First, left-to-right pass fills out N+1 states sets Think of the states sets as sitting between words in the input string, keeping track of states of the parse at these positions For each word position, a set of states represents all partial parse trees generated to date. E.g. the state set S 0 contains all partial parse trees generated at the beginning of the sentence

Informal Explanation How to recognize a sentence? When we go over a state in S i, we notice 3 cases: The dot is not at the end of the state The dot is before a non-terminal symbol => Predictor The dot is before a terminal symbol => Scanner The dot is at the end of the state => Completer

Informal Explanation The predictor operation: If the dot is before a non-terminal symbol: Adds new states to the current state set One new state for each expansion of the non-terminal in the grammar Formally: Why? S j : A α B β l 1 i S j : B γ l 2 j (l 2 = first k symbols of β +l 1 )

S 0 Φ.E & & 0 E.E + T & 0 E.T & 0 E.E + T + 0 E.T + 0 T.a & 0 T.a + 0 Grammar: Input string:

Informal Explanation The scanner operation: If the dot is before a terminal symbol: compare that symbol with X i+1 if they match, it adds the state to the next state set, with the dot moved over one symbol in the state Formally: S j : A α B β l 1 i S j+1 : A α B β l 1 i Why?

S 0 Φ.E & & 0 E.E + T & 0 E.T & 0 E.E + T + 0 E.T + 0 T.a & 0 T.a + 0 Grammar: Input string: S 1. & 0. + 0

The completer: Informal Explanation if the dot of a state is at the end of a its production: compares the look-ahead string with P=X i+1 X i+k If they match: goes back to the state set S i indicated by the pointer adds all states from S i which have the derived nonterminal to the right of the dot For each of these states the dot is placed after this nonterminal

S 0 S 1. & 0. + 0. & 0. + 0 Φ E. & & 0 E E. + T & 0 E E. + T + 0 Φ.E & E.E + T E.T E.E + T E.T T.a T.a & & & + + & + 0 0 0 0 0 0 0

Informal Explanation After going over all states in S i, we move on to S i+1 If the algorithm ever produces an S i+1 consisting of the single state Φ E &. & 0 then we have correctly scanned E and the & Symbol we are finished with the string, which means the input string is a sentence of the grammar!

The Recognizer A precise description of the recognizer: Given: input string X 1 X n., grammar G. We arbitrarily number the productions 1 d-1, where each production p is of the form: D p C p1 C pm (m = # of symbols in the alternative) We add a 0-th production: D 0 R & (R is the root of the grammar)

The Recognizer Definition: a state S i is a quadruple <p, j, f, α>: p the number of the production rule (0 p d-1) j the location in production rules (0 j m) f the number of state set that created this state (0 f n+1) α look ahead string state set is an ordered set of states A final state is one in which j = m We add a state to a state set by putting it last in the ordered set (unless it is already a member)

The Recognizer Definition: H k (γ) is the set of all k-symbol terminal strings which begin some string derived from γ H k (γ) = { α α is terminal, α = k and Эβ s.t. γ = * > αβ } used in forming the look-ahead string for the states

The Recognizer This is a function of 3 arguments - REC(G, X i X n, k) computed as follows: // initialization: Let X n+i = & (for each 1 i k + 1) Let S i be empty (for each 0 i n + 1) Add (0,0,0,& k ) to S o

For i 0 step 1 until n do Begin The Recognizer Process the states of S i in order, performing one of the following three operations on each state s = <p, j, f, α>:

The Recognizer (1) Predictor: If s is nonfinal and C p(j+l) is a nonterminal, then for each q s.t. C p(j+l) = Dq, and for each β Є H k (C p(j+2)..- C pk ) add <q, 0, i, β> to S i

The Recognizer (2) Completer: If s is final and α = X i+1... X i+k, then for each <q,l,g,β> Є S f (after all states have been added to S f ) s.t. C q(l+1) = D p add <q,l + 1, g, β> to S i

The Recognizer (3) Scanner: If s is non-final and C p(j+l) is terminal then if C p(j+l) = X i+1 add <p, j+1, f, α> to S i+1

The Recognizer // rejection condition If S i+1 is empty, return rejection // acceptance condition If i = n and S i+1 = {(0,2,0,&>}, return acceptance End

The Recognizer Notations: The ordering imposed on state sets is not important to their meaning simply a device which allows their members to be processed correctly by the algorithm i cannot become greater than n without either rejection or acceptance occurring the & symbol appears only in production zero

Outline revisited Yaniv: Introduction Terminology Informal Explanation The Recognizer Morad: Example Time and Space Bounds Empirical results Practical Use

Grammar: Terminals: {a, +} Non-terminals: {E, T} Root: E Look ahead: 1 Input String: a + a

S 0 Φ.E & & 0 Put the initial state in S 0

S 0 Φ.E & & 0 E.E + T & 0 Predictor

S 0 Φ.E & & 0 E.E + T & 0 E.T & 0 Predictor

S 0 Φ.E & & 0 E.E + T & 0 E.T & 0 E.E + T + 0 Predictor

S 0 Φ.E & & 0 E.E + T & 0 E.T & 0 E.E + T + 0 E.T + 0 Predictor

S 0 Φ.E & & 0 E.E + T & 0 E.T & 0 E.E + T + 0 E.T + 0 T.a & 0 Predictor

S 0 Φ.E & & 0 E.E + T & 0 E.T & 0 E.E + T + 0 E.T + 0 T.a & 0 Predictor state already exist

S 0 Φ.E & & 0 E.E + T & 0 E.T & 0 E.E + T + 0 E.T + 0 T.a & 0 T.a + 0 Predictor

S 0 Φ.E & & 0 E.E + T & 0 E.T & 0 E.E + T + 0 E.T + 0 T.a & 0 T.a + 0 S 1. & 0 Scanner

S 0 Φ.E & & 0 E.E + T & 0 E.T & 0 E.E + T + 0 E.T + 0 T.a & 0 T.a + 0 S 1. & 0. + 0 Scanner

S 0 Φ.E & & 0 E.E + T & 0 E.T & 0 E.E + T + 0 E.T + 0 T.a & 0 T.a + 0 S 1. & 0. + 0 Completer look ahead is not equal.

S 0 Φ.E & & 0 E.E + T & 0 E.T & 0 E.E + T + 0 E.T + 0 T.a & 0 T.a + 0 S 1. & 0. + 0. & 0. + 0 Completer look ahead is equal. add all states from S 0 that the dot is before T.

S 0 Φ.E & & 0 E.E + T & 0 E.T & 0 E.E + T + 0 E.T + 0 T.a & 0 T.a + 0 S 1. & 0. + 0. & 0. + 0 Completer look ahead is not equal.

S 0 Φ.E & & 0 E.E + T & 0 E.T & 0 E.E + T + 0 E.T + 0 T.a & 0 T.a + 0 S 1. & 0. + 0. & 0. + 0 Φ E. & & 0 E E. + T & 0 E E. + T + 0 Completer look ahead is equal. add all states from S 0 that the dot is before E.

S 0 Φ.E & & 0 E.E + T & 0 E.T & 0 E.E + T + 0 E.T + 0 T.a & 0 T.a + 0 S 1. & 0. + 0. & 0. + 0 Φ E. & & 0 E E. + T & 0 E E. + T + 0 Predictor nothing to do.

S 0 Φ.E & & 0 E.E + T & 0 E.T & 0 E.E + T + 0 E.T + 0 T.a & 0 T.a + 0 S 1. & 0. + 0. & 0. + 0 Φ E. & & 0 E E. + T & 0 E E. + T + 0 Scanner symbol is not equal.

S 0 Φ.E & & 0 E.E + T & 0 E.T & 0 E.E + T + 0 E.T + 0 T.a & 0 T.a + 0 S 2 E E +. T & 0 Scanner. S 1. & 0. + 0. & 0. + 0 Φ E. & & 0 E E. + T & 0 E E. + T + 0

S 0 Φ.E & & 0 E.E + T & 0 E.T & 0 E.E + T + 0 E.T + 0 T.a & 0 T.a + 0 S 2 E E +. T & 0 E E +. T + 0 Scanner. S 1. & 0. + 0. & 0. + 0 Φ E. & & 0 E E. + T & 0 E E. + T + 0

S 0 Φ.E & & 0 E.E + T & 0 E.T & 0 E.E + T + 0 E.T + 0 T.a & 0 T.a + 0 S 2 E E +. T & 0 E E +. T + 0 T.a & 2 Predictor. S 1. & 0. + 0. & 0. + 0 Φ E. & & 0 E E. + T & 0 E E. + T + 0

S 0 Φ.E & & 0 E.E + T & 0 E.T & 0 E.E + T + 0 E.T + 0 T.a & 0 T.a + 0 S 2 E E +. T & 0 E E +. T + 0 T.a & 2 T.a + 2 Predictor. S 1. & 0. + 0. & 0. + 0 Φ E. & & 0 E E. + T & 0 E E. + T + 0

S 0 Φ.E & & 0 E.E + T & 0 E.T & 0 E.E + T + 0 E.T + 0 T.a & 0 T.a + 0 S 2 S 3 E E +. T & 0 E E +. T + 0 T.a & 2 T.a + 2. & 2 Scanner. S 1. & 0. + 0. & 0. + 0 Φ E. & & 0 E E. + T & 0 E E. + T + 0

S 0 Φ.E & & 0 E.E + T & 0 E.T & 0 E.E + T + 0 E.T + 0 T.a & 0 T.a + 0 S 2 S 3 E E +. T & 0 E E +. T + 0 T.a & 2 T.a + 2. & 2. + 2 Scanner. S 1. & 0. + 0. & 0. + 0 Φ E. & & 0 E E. + T & 0 E E. + T + 0

S 0 S 1 Φ.E & & 0 E.E + T & 0 E.T & 0 E.E + T + 0 E.T + 0 T.a & 0 T.a + 0. & 0. + 0. & 0. + 0 Φ E. & & 0 E E. + T & 0 E E. + T + 0 S 2 S 3 E E +. T & 0 E E +. T + 0 T.a & 2 T.a + 2. & 2. + 2. & 0. + 0 Completer look ahead is add all states equal. from S 2 that the dot is before T.

S 0 S 1 Φ.E & & 0 E.E + T & 0 E.T & 0 E.E + T + 0 E.T + 0 T.a & 0 T.a + 0. & 0. + 0. & 0. + 0 Φ E. & & 0 E E. + T & 0 E E. + T + 0 S 2 S 3 E E +. T & 0 E E +. T + 0 T.a & 2 T.a + 2. & 2. + 2. & 0. + 0 Completer look ahead is not equal.

S 0 S 1 Φ.E & & 0 E.E + T & 0 E.T & 0 E.E + T + 0 E.T + 0 T.a & 0 T.a + 0. & 0. + 0. & 0. + 0 Φ E. & & 0 E E. + T & 0 E E. + T + 0 S 2 S 3 E E +. T & 0 E E +. T + 0 T.a & 2 T.a + 2. & 2. + 2. & 0. + 0 Φ E. & & 0 E E. + T & 0 E E. + T + 0 Completer look ahead is equal. add all states from S 0 that the dot is before E.

S 0 S 1 Φ.E & & 0 E.E + T & 0 E.T & 0 E.E + T + 0 E.T + 0 T.a & 0 T.a + 0. & 0. + 0. & 0. + 0 Φ E. & & 0 E E. + T & 0 E E. + T + 0 S 2 S 3 S 4 E E +. T & 0 E E +. T + 0 T.a & 2 T.a + 2. & 2. + 2. & 0. + 0 Φ E. & & 0 E E. + T & 0 E E. + T + 0 Φ E &. & 0 Scanner.

S 0 S 1 Φ.E & & 0 E.E + T & 0 E.T & 0 E.E + T + 0 E.T + 0 T.a & 0 T.a + 0. & 0. + 0. & 0. + 0 Φ E. & & 0 E E. + T & 0 E E. + T + 0 S 2 S 3 S 4 E E +. T & 0 E E +. T + 0 T.a & 2 T.a + 2. & 2. + 2. & 0. + 0 Φ E. & & 0 E E. + T & 0 E E. + T + 0 Φ E &. & 0 We ve reached the final state the string belongs to the grammar.

Time and Space Bounds In general the running time of the algorithm is O(n 3 ). S i ={<p, j, f, α>} p the number of the production rule. j the location in the production rule. f the number of state set that created this state. α look ahead. The number of states in any state set S i is O(i): p, j and α are bounded by the grammar properties. f bounded by i.

Time and Space Bounds cont. The scanner and predictor operations each execute a bounded number of steps per state in any state set. So the total time for processing the states in S i plus the scanner and predictor operations is O(i). The completer executes O(i) steps for each state it processes in the worst case because it may have to add O(j) states for S j, the state set pointed back to. So it takes O(i 2 ) steps in S i. Summing for all of the state sets give us O(n 3 ) steps. This bound holds even if the look-ahead is not used.

Time and Space Bounds cont. Only the completer is O(i 2 ) in what cases the completer will need only O(i) steps? After the completer has been applied on a state S i there are O(i) states in it. So unless some of the states were added in more than one way it took the completer O(i) steps to complete its operation.

Time and Space Bounds cont. In case that the grammar is unambiguous and reduced, we can show that each such state gets added in only one way. Assume that the state D q C q1,1 C q,(j+1) C q,q α f is added to S i in two different ways by the completer. Then we have two states in S i D p1 A p1,1 A p1,p 1 X i+1... X i+k f l D p2 A p2,1 A p2,p 2 X i+1... X i+k f 2 And C q,(j+1) = D p1 = D p2 and (p 1 p 2 or f 1 f 2 )

Time and Space Bounds cont. That means that we have two state sets S f1 and S f2 as follows: s f1 : D q C q1,1 C q,(j+1) C q,q α f s f2 : D q C q1,1 C q,(j+1) C q,q α f s i : D p1 A p1,1 A p1,p 1 X i+1... X i+k f l D p2 A p2,1 A p2,p 2 X i+1... X i+k f 2 So now we have S X 1... X f D q β X 1... X f C q,1... C q,(j+1)... C q,q β X 1... X f1 A p1,1... A p1,p 1 β 1 X 1... X i β 1 and S X 1... X f D q β X 1... X f C q,1... C q,(j+1)... C q,q β X 1... X f2 A p2,1... A p2,p 2 β 2 X 1... X i β 2

Time and Space Bounds cont. Since that p 1 p 2 or f 1 f 2 the derivations of X 1... X i are represented by different derivation trees. Therefore there is an ambiguous sentence X 1... X i α for some α. So if the grammar is unambiguous, the completer executes O(i) steps per state set and the time is bounded by O(n 2 ). This running time is also true for grammars with bounded ambiguity.

Time and Space Bounds cont. For LR(k) grammars the running time is O(n). Space the algorithm uses O(n) state sets, each containing O(n) states, therefore the space bound is O(n 2 ) in general.

Empirical Results The algorithm was tested with other context-free parsing algorithms and its running time was similar or better than the other algorithms. The algorithm was also as good as other specialist algorithms that works fast but only on specific types of grammars (like Knuth's algorithm that works only on LR(k) grammars in O(n))

Practical Use Changing the recognizer into a parser: Each time the completer add a state E αd.β g construct a pointer from the instance of D in that state to the state D γ. f which caused the completer to do the operation. E α D β γ