Top-Down Parsing Compiler Principles, PS4 Parsing problem definition: The general parsing problem is - given set of rules and input stream (in our case scheme token input stream), how to find the parse tree of the input stream? Parsing problem consists of finding derivation of a particular sentence using a given grammar. In top-down parsing we start with the sentence symbol (start symbol) and "generate" the sentence. Topdown parser tries to find the leftmost derivation for a given sentence. It can be viewed as an attempt to construct a parse tree from start symbol. leftmost derivation = apply the grammar rules on the left most none terminal. Recursive Decent Parser: For each none-terminal, we define a function that checks if the input contains a right side of that noneterminal. For each terminal we have a function that checks that the input contains this terminal. Example: α aβ => bool func_α(), return ( token(a) && func_β() ); - Example: α β γ => bool func_α(), return ( func_β() func_γ() ); - The problems: 1. Long run-time with OR - we don t know which to choose. So we check the first option (which might be a long check) and then find out it is false, go back to check the second option. 2. unable to predict which rule to choose, and unable to undo a choice we made. Examples: α a β, β ab. We will never be able to choose the second rule of α because if we have ab in the input, then the first rule is checked, and is true. S Aab, A a ε. We can t identify ab. S Aab, A ε a. We can t identify aab. 3. unable to handle left recursion grammars will go into an endless loop. Example: S Sab ε. Predictive Parser A grammar is said to be left-recursive if there is a derivation of type A->Ax. The parser might get into infinite loops while trying to parse some input! A grammar is ambiguous if there exists a string that can be derived from the start symbol using 2 different parse trees. Lookahead is the number of symbols required in order to identify a correct derivation. This number is determined accordingly to the given grammar. We'll assume our grammars are with lookahead = 1. grammar ambiguity: given first token in input stream there might be several rules matching. If the choice isn't lucky the parsing won't be successful, although a parse tree for the sentence exists!
In order to perform top-down parsing we must make sure that the grammar is not left-recursive and not ambiguous. LL(1) LL(1) is set of all grammars in which the string is read from Left to right, the Left most derivation is always used first and 1 input symbol is enough to determine derivation rule to use (e.g. lookahead = 1). If lookahead of 1 is not enough then the grammar is not considered to be LL(1). For top-down parsing we need unambiguous LL(1). Definitions: Lookahead symbol- the current input symbol or the input end marker ($, for example). Starter symbol (aka First) of a given non-terminal is any symbol which may appear at the start of a string generated by this non-terminal. Follower symbol (aka Follow) of a given non-terminal is any symbol that can follow the nonterminal. Computing First sets: The First set computed for every non-terminal and right-hand side of grammar rules. Every First set initialized to be an empty set. 1. For each rule: N α add First(α) to First(N) 2. For each rule: α=ab add First(A) to First(α) 3. For each rule: α=ab where A is nullable add First(B) to First(α) 4. First(aβ) will include a, when a is a terminal Computing Follow sets: The Follow set computed for every non-terminal. Every Follow set initialized to be an empty set except the start symbol: Follow(S) = {$} 1. For each rule: M αnβ add First(β) to Follow(N) 2. For each rule: M αnβ where β is nullable add Follow(M) to Follow(N)
Example: Input grammar: SAB PQx Axy m BbC CbC ε PpP ε QqQ ε Lets compute the First set for each rule: rule First SAB {x, m} SPQx {p, q,x} Axy {x} Am {m} BbC {b} CbC {b} C ε {} PpP {p} P ε {} QqQ {q} Q ε {} Lets compute the Follow set for each non-terminal: Non-terminal U(First) Follow A {x,m} {b} B {b} {$} C {b} {$} P {p} {q,x} Q {q} {x} S {x,m,p,q } {$}
Now we can calculate the Director Symbol Set (DSS) for each rule N α, DSS(N α) = First(α) U ( Follow(N) if α is nullable ) DSS(SAB) = {x, m} U ( Follow(S) if {x, m} are nullable) = {x,m} DSS(SPQx) = {p, q, x} U ( Follow(S) if {p, q, x} are nullable) = {p,q,x} DSS(Axy) = {x} U ( Follow(A) if {x} is nullable) = {x} DSS(Am) = {m} U ( Follow(A) if {m} is nullable ) = {m} DSS(BbC) = {b} U ( Follow(B) if {b} is nullable) = {b} DSS(CbC) = {b} U ( Follow(C) if {b} is nullable) = {b} DSS(C ε) = {} U ( Follow(C) if {ε} is nullable) = {$} DSS(PpP) = {p} U ( Follow(P) if {p} is nullable) = {p} DSS(P ε) = {} U ( Follow(P) if {ε} is nullable) = {q,x} DSS(QqQ) = {q} U ( Follow(Q) if {q} is nullable) = {q} DSS(Q ε) = {} U ( Follow(Q) if {ε} is nullable) = {x} Corresponding Predictive Parsing Table: For each rule N α, we put this rule in each cell (N,x) if xdss(n α). S A B C P Q b BbC CbC p SPQx PpP q SPQx P ε QqQ x SAB, SPQx Axy P ε Q ε y m SAB Am $ C ε To parse "mbbb" is easy S->AB, A->m, B->bC, C->bC, C->bC, C-> epsilon, but parsing "x" is ambiguous! The grammar is not LL(1) if there are 2 entries in the PPT, or there is a conflict: First/First conflict for each non-terminal N, the First of all its alternatives must be disjunct. (Example: S α β, where First(α) First(β) Ø ). First/Follow conflict for each non-terminal N that has a nullable alternative, Follow(N) must be disjunctive from all First of each alternative of N. Example: S Aa, A a ε, A has a nullable alternative (ε), and Follow(A)={a} First(A)={a} Ø. Multiple nullable alternatives: for each non-terminal N, there can be at most one alternative which is nullable.
Now we build a Push-Down Automaton from the PPT: Init push S into stack. At each step, we look at top of stack, and do one of the two moves: Prediction move when pop() = non-terminal N, we look at PPT[N,t] where t is the lookahead token (next token in the input). If PPT[N,t] is empty then we return syntax error. If PPT[N,t] = some rule N α, we push α. Notice if α = α 1 α 2 then we push α 2 and then push α 1. Match move when pop() = terminal x, we check that next token in input is x, if it is, we remove x from the input, else we output syntax error. Termination is achieved when stack is empty and input stream is empty. Else there is a syntax error. This only recognizes the grammar, To construct a parse tree do: Predict moves will create new nodes for everything it pushes into the stack (whose parent is what was popped), and match will insert the token into the node he popped.