1 Syntax Analysis Part I Chapter 4 COP5621 Compiler Construction Copyright Robert van ngelen, Flora State University, 2007 Position of a Parser in the Compiler Model 2 Source Program Lexical Analyzer Lexical error oken, tokenval Get next token Parser and rest of front-end Syntax error Semantic error Intermediate representation Symbol able 3 he Parser A parser implements a C-F grammar he role of the parser is twofold: 1. o check syntax (= string recognizer) And to report syntax errors accurately 2. o invoke semantic actions For static semantics checking, e.g. type checking of expressions, functions, etc. For syntax-directed translation of the source code to an intermediate representation 1
4 Syntax-Directed ranslation One of the major roles of the parser is to produce an intermediate representation (IR) of the source program using syntax-directed translation methods Possible IR output: Abstract syntax trees (ASs) Control-flow graphs (CFGs) with triples, three-address code, or register transfer list notation WHIRL (SGI Pro64 compiler) has 5 IR levels! 5 rror Handling A good compiler should assist in entifying and locating errors Lexical errors: important, compiler can easily recover and continue Syntax errors: most important for compiler, can almost always recover Static semantic errors: important, can sometimes recover Dynamic semantic errors: hard or impossible to detect at compile time, runtime checks are required Logical errors: hard or impossible to detect 6 Prefix Viable-Prefix Property he viable-prefix property of LL/LR parsers allows early detection of syntax errors Goal: detection of an error as soon as possible without further consuming unnecessary input How: detect an error as soon as the prefix of the input does not match a prefix of any string in the language rror is detected here for (;) Prefix rror is detected here DO 10 I = 1;0 2
7 rror Recovery Strategies Panic mode Discard input until a token in a set of designated ronizing tokens is found Phrase-level recovery Perform local correction on the input to repair the error rror productions Augment grammar with productions for erroneous constructs Global correction Choose a minimal sequence of changes to obtain a global least-cost correction 8 Grammars (Recap) Context-free grammar is a 4-tuple G = (N,, P, S) where is a finite set of tokens (terminal symbols) N is a finite set of nonterminals P is a finite set of productions of the form α β where α (N )* N (N )* and β (N )* S N is a designated start symbol 9 Notational Conventions Used erminals a,b,c, specific terminals: 0, 1,, Nonterminals A,B,C, N specific nonterminals: expr, term, stmt Grammar symbols X,Y,Z (N ) Strings of terminals u,v,w,x,y,z * Strings of grammar symbols α,β,γ (N )* 3
10 Derivations (Recap) he one-step derivation is defined by α A β α γ β where A γ is a production in the grammar In addition, we define is leftmost lm if α does not contain a nonterminal is rightmost rm if β does not contain a nonterminal ransitive closure * (zero or more steps) Positive closure (one or more steps) he language generated by G is defined by L(G) = {w * S w} 11 Derivation (xample) Grammar G = ({}, {,*,(,),-,}, P, ) with productions P = * ( ) - xample derivations: - - rm rm rm * * * Chomsky Hierarchy: Language Classification A grammar G is sa to be Regular if it is right linear where each production is of the form A w B or A w or left linear where each production is of the form A B w or A w Context free if each production is of the form A α where A N and α (N )* Context sensitive if each production is of the form α A β α γ β where A N, α,γ,β (N )*, γ > 0 Unrestricted 12 4
13 Chomsky Hierarchy L(regular) L(context free) L(context sensitive) L(unrestricted) Where L() = { L(G) G is of type } hat is: the set of all languages generated by grammars G of type xamples: very finite language is regular! (construct a FSA for strings in L(G)) L 1 = { a n b n n 1 } is context free L 2 = { a n b n c n n 1 } is context sensitive 14 Parsing Universal (any C-F grammar) Cocke-Younger-Kasimi arley op-down (C-F grammar with restrictions) Recursive descent (predictive parsing) LL (Left-to-right, Leftmost derivation) methods Bottom-up (C-F grammar with restrictions) Operator precedence parsing LR (Left-to-right, Rightmost derivation) methods SLR, canonical LR, LALR 15 op-down Parsing LL methods (Left-to-right, Leftmost derivation) and recursive-descent parsing Grammar: ( ) - Leftmost derivation: lm lm lm 5
16 Left Recursion (Recap) Productions of the form A A α β γ are left recursive When one of the productions in a grammar is left recursive then a predictive parser loops forever on certain inputs General Left Recursion limination Method 17 Arrange the nonterminals in some order A 1, A 2,, A n for i = 1,, n do for j = 1,, i-1 do replace each A i A j γ with A i δ 1 γ δ 2 γ δ k γ where A j δ 1 δ 2 δ k enddo eliminate the immediate left recursion in A i enddo Immediate Left-Recursion limination Method 18 Rewrite every left-recursive production A A α β γ A δ into a right-recursive production: A β A R γ A R A R α A R δ A R ε 6
xample Left Recursion lim. A B C a B C A A b C A B C C a Choose arrangement: A, B, C 19 i = 1: i = 2, j = 1: i = 3, j = 1: i = 3, j = 2: nothing to do B C A A b B C A B C b a b (imm) B C A B R a b B R B R C b B R ε C A B C C a C B C B a B C C a C B C B a B C C a C C A B R C B a b B R C B a B C C a (imm) C a b B R C B C R a B C R a C R C R A B R C B C R C C R ε 20 Left Factoring When a nonterminal has two or more productions whose right-hand ses start with the same grammar symbols, the grammar is not LL(1) and cannot be used for predictive parsing Replace productions A α β 1 α β 2 α β n γ with A α A R γ A R β 1 β 2 β n 21 Predictive Parsing liminate left recursion from grammar Left factor the grammar Compute FIRS and FOLLOW wo variants: Recursive (recursive calls) Non-recursive (table-driven) 7
22 FIRS (Revisited) FIRS(α) = { the set of terminals that begin all strings derived from α } FIRS(a) = {a} if a FIRS(ε) = {ε} FIRS(A) = A α FIRS(α) for A α P FIRS(X 1 X 2 X k ) = if for all j = 1,, i-1 : ε FIRS(X j ) then add non-ε in FIRS(X i ) to FIRS(X 1 X 2 X k ) if for all j = 1,, k : ε FIRS(X j ) then add ε to FIRS(X 1 X 2 X k ) FOLLOW FOLLOW(A) = { the set of terminals that can immediately follow nonterminal A } FOLLOW(A) = for all (B α A β) P do add FIRS(β)\{ε} to FOLLOW(A) for all (B α A β) P and ε FIRS(β) do add FOLLOW(B) to FOLLOW(A) for all (B α A) P do add FOLLOW(B) to FOLLOW(A) if A is the start symbol S then add to FOLLOW(A) 23 LL(1) Grammar A grammar G is LL(1) if it is not left recursive and for each collection of productions A α 1 α 2 α n for nonterminal A the following holds: 1. FIRS(α i ) FIRS(α j ) = for all i j 2. if α i * ε then 2.a. α j * ε for all i j 2.b. FIRS(α j ) FOLLOW(A) = for all i j 24 8
25 Non-LL(1) xamples Grammar S S a a S a S a S a R ε R S ε S a R a R S ε Not LL(1) because: Left recursive FIRS(a S) FIRS(a) For R: S * ε and ε * ε For R: FIRS(S) FOLLOW(R) Recursive Descent Parsing (Recap) 26 Grammar must be LL(1) very nonterminal has one (recursive) procedure responsible for parsing the nonterminal s syntactic category of input tokens When a nonterminal has multiple productions, each production is implemented in a branch of a selection statement based on input look-ahead information Using FIRS and FOLLOW to Write a Recursive Descent Parser 27 expr term rest rest term rest - term rest ε term procedure rest(); begin if lookahead in FIRS( term rest) then match( ); term(); rest() else if lookahead in FIRS(- term rest) then match( - ); term(); rest() else if lookahead in FOLLOW(rest) then return else error() end; FIRS( term rest) = { } FIRS(- term rest) = { - } FOLLOW(rest) = { } 9
Non-Recursive Predictive Parsing: able-driven Parsing Given an LL(1) grammar G = (N,, P, S) construct a table M[A,a] for A N, a and use a driver program with a stack 28 input a b stack X Y Z Predictive parsing program (driver) Parsing table M output Constructing an LL(1) Predictive Parsing able 29 for each production A α do for each a FIRS(α) do add A α to M[A,a] enddo if ε FIRS(α) then for each b FOLLOW(A) do add A α to M[A,b] enddo endif enddo Mark each undefined entry in M error xample able R R ε R * F R ε F ( ) A α F ( ) FIRS(α) ε ( FOLLOW(A) ( ) R R ) ε ) ( ) R * F R * ) ) * ) * ) 30 * ( ) R R R R R * F R F F ( ) 10
Ambiguous grammar S i t S S R a S R e S ε b rror: duplicate table entry S S R a S a LL(1) Grammars are Unambiguous b b e S R ε S R e S A α S i t S S R S a S R e S S R ε b i S i t S S R FIRS(α) i a e ε b 31 FOLLOW(A) e e e e t t S R ε Predictive Parsing Program (Driver) push() push(s) a := lookahead repeat X := pop() if X is a terminal or X = then match(x) // moves to next token and a := lookahead else if M[X,a] = X Y 1 Y 2 Y k then push(y k, Y k-1,, Y 2, Y 1 ) // such that Y 1 is on top else endif until X = invoke actions and/or produce IR output error() 32 xample able-driven Parsing 33 Stack R R R F R R R R R R R R R F R R R R R R F* R R F R R R R R Input * * * * * * * * * * * * Production applied R R R * F R 11
34 Panic Mode Recovery Add ronizing actions to undefined entries based on FOLLOW Pro: Can be automated Cons: rror messages are needed FOLLOW() = { ) } FOLLOW( R ) = { ) } FOLLOW() = { ) } FOLLOW( R ) = { ) } FOLLOW(F) = { * ) } R R F R R * R * F R ( F ( ) ) : the driver pops current nonterminal A and skips input till token or skips input until one of FIRS(A) is found Phrase-Level Recovery 35 Change input stream by inserting missing tokens For example: is changed into * Pro: Can be automated Cons: Recovery not always intuitive Can then continue here R R F insert * R R * R * F R ( F ( ) ) insert *: driver inserts missing * and retries the production 36 rror Productions R R ε R * F R ε F ( ) Add error production : R F R to ignore missing *, e.g.: Pro: Powerful recovery method Cons: Cannot be automated R R F R F R R R * R * F R ( F ( ) ) 12