CSC 173 Tuess. Nov 5, 2002 Read AU, Ch 11 ---------- ------------------------------------------------------------------------ First Sets To distinguish two productions with the same non-terminal on the left hand side, we examine the First sets for their corresponding right hand sides. We do this in 3 steps (1) figure out which non-terminals can generate epsilon (2) figure out FIRST sets for all non-terminals (3) figure out FIRST sets for right-hand sides Steps (1) and (2) start with "obvious" facts from the grammar and iterate until they can't learn any more. Consider step (1). If we have A --> epsilon B --> epsilon then clearly A and B are symbols that can generate epsilon. These are the "obvious" facts. Then in a second pass over the grammar, if we have C --> A B we can deduce that C is a symbol that can generate epsilon. If we have D --> C A B then in a third pass we can deduce that D is a symbol that can generate epsilon. We continue this process until we make a complete pass over the grammar without learning anything. Now consider step (2). If we have A --> b C D B --> c D e then clearly b is an element of FIRST(A) and c is an element of FIRST(B). These are obvious facts. Then in a second pass if we have C --> B A d clearly c is an element of FIRST(C), because it's an element of FIRST(B) and a C can start with a B. But if B can generate epsilon, then b is also an element of FIRST(C), because we can erase the B and generate the b from A. In each pass over the grammar we work our way through each RHS, adding elements to the FIRST set of the LHS, until we find a symbol in the RHS that cannot generate epsilon, at which point we move on to the next production. As in step (1) we keep making passes until we don't learn anything new. Finally, in step (3) we use our knowledge of FIRST sets for individual symbols to calculate FIRST sets for RHSs. Given the production A --> X1...Xm we must determine First(X1...Xm). We first consider the leftmost symbol, X1. * If this is a terminal symbol, then First(X1...Xm) = X1. * If X1 is a non-terminal, then we compute the First sets for each right hand side corresponding to X1. In our expression grammar above: First(E) = First(T Etail) First(T Etail) = First(T) First(T) = First(F Ttail) First(F Ttail) = First(F) = {(,num} If X1 can generate epsilon, then X1 can (in effect) be erased, and First(X1...Xm) depends on X2. * If X2 is a terminal, it is included in First(X1...Xm). * If X2 is a non-terminal, we compute the First sets for each of its corresponding right hand sides. Similarly, if both X1 and X2 can produce epsilon, we consider X3, then X4, etc. It is possible that X1, X2, ..., Xm can *all* produce epsilon. What then? The informal answer is that we should predict A --> X1...Xm if the lookahead symbol can come *after* an A in some line of the derivation. A formal treatment of this subject requires the notion of so-called Follow sets for symbols. In practice, we don't generally have to know about Follow sets when building a recursive-descent parser. Suppose we have three productions for A: A --> B c D A --> e f A --> G H where G and H can both generate epsilon. Our parsing routine then says: A() { switch (next_token) { case First(BcD): B match(c) D case e: match(e) match(f) default: G H If next_token is not in First(BcD) U {e}, we assume we can use the third production. If it turns out that next_token is not in First(GH) U Follow(A) either, then this was a bad decision, but nothing catastrophic happens: the calls to G and H will go ahead and generate epsilon, we'll return, and our caller will announce a syntax error -- just a bit later than we could have. --------------------------------------- LL(1) Grammars for Recursive-Descent Parsing (also known as "top-down" or "predictive" parsing) Recursive-descent parsing can only parse those CFG's that have disjoint predict sets for productions that share a common left hand side. CFG's that obey this restriction are called LL(1). From experience we know that it is usually possible to create an LL(1) CFG for a programming language. However, not all CFG's are LL(1) and a CFG that is not LL(1) may be parsable using some other (usually more complex) parsing technique. Two common properties of grammars that produce trouble for top-down parsing are: * Left recursion: any grammar containing productions with left recursion, that is, productions of the form A --> A X1...Xm, cannot be LL(1). The problem is that any symbol that predicts this production the first time will, of necessity, continue to predict this production forever (and never be matched). * Common prefix: any grammar containing two productions for the same non-terminal that share a common prefix on the right hand side cannot be LL(1). The problem is that any symbol that predicts the first production must also predict the second; since the predict sets for the two productions are not disjoint, the grammar is not LL(1). --------------------------------------- Creating an LL(1) Grammar Consider the following grammar for expressions: 1. E --> E + T 2. E --> E - T 3. E --> T 4. T --> T * F 5. T --> T / F 6. T --> F 7. F --> ( E ) 8. F --> number This grammar has left recursion, and therefore cannot be LL(1). We can replace the use of left recursion with right recursion as follows: 1. E --> T + E 2. E --> T - E 3. E --> T 4. T --> F * T 5. T --> F / T 6. T --> F 7. F --> ( E ) 8. F --> number The resulting grammar is still not LL(1); productions 1-3 share a common prefix, as do productions 4-6. We can eliminate the common prefix by defering the decision as to which production to pick until after seeing the common prefix. This technique is called factoring the common prefix. 1. E --> T Etail 2. Etail --> + T Etail | - T Etail | epsilon 3. T --> F Ttail 4. Ttail --> * F Ttail | / F Ttail | epsilon 5. F --> ( E ) | number And this is, of course, our top-down grammar for expressions. WARNING: while it is possible to mechanically eliminate left recursion and common prefixes from a grammar, this is not guaranteed to make the result LL(1). Some languages just can't be parsed top-down. Here's a grammar for one: G --> a B b G --> a C c B --> a B b B --> C --> a C c C --> The language consists of all strings of a's followed by an equal number of b's or by an equal number of c's. Left factoring doesn't solve the problem (try it!). ------------------------------------------------------------------------ Table-Driven Parsing In recursive-descent parsing, the decision as to which production to choose for a particular non-terminal is hard-coded into the procedure for the non-terminal. The problem with recursive-descent parsing is that it is inflexible; changes in the grammar can cause significant (and in some cases non-obvious) changes to the parser. Since recursive-descent parsing uses an implicit stack of procedure calls, it is possible to replace the parsing procedures and implicit stack with an explicit stack and a single parsing procedure that manipulates the stack. In this scheme, we encode the actions the parsing procedure should take in a table. This table can be generated automatically (with the grammar as input), which is why this approach adapts more easily to changes in the grammar. (BTW, we could automatically generate a recursive-descent parser, but that's no easier, and it's likely to be a little slower, so nobody bothers.) Note the analogy to scanning: table-driven top-down parsers are to recursive descent parsers as table-driven scanners are to nested-switch-statement scanners. --------------------------------------- A Table-Driven Parser The parse table encodes the choice of production as a function of the current non-terminal of interest and the lookahead symbol. T: Non-terminals x Terminals -> Productions U {Error} The entry T[A,x] gives the production number to choose when A is the non-terminal of interest and x is the current input symbol. The table is a mapping from non-terminals x terminals to productions. T[A,x] == A -> X1..Xm if x in Predict(A->X1..Xm) otherwise T[A,x] == Error The driver procedure is very simple. It stacks symbols that are to be matched or expanded. Terminal symbols on the stack must match an input symbol; non-terminal symbols are expanded according to the Predict sets, which are embedded in the parse table. The Predict set for a given production is basically the First set for its RHS. The possible exception arises for epsilon productions (including productions that can generate epsilon indirectly): for these we can, if we want, use Follow sets (mentioned but not described above) to avoid predicting an epsilon production when it's certain to lead to an error later on. --------------------------------------- Parse Table for Expressions Here is an LL(1) expression grammar, augmented to include the end marker: 1. S --> E eof 2. E --> T Etail 3. Etail --> + T Etail 4. Etail --> - T Etail 5. Etail --> epsilon 6. T --> F Ttail 7. Ttail --> * F Ttail 8. Ttail --> / F Ttail 9. Ttail --> epsilon 10. F --> ( E ) 11. F --> number The table for this expression grammar is as follows, where a blank entry corresponds to an error: ( ) + - * / Number eof ----------------------------------------------- S 1 1 ----------------------------------------------- E 2 2 ----------------------------------------------- Etail 5 3 4 5 ----------------------------------------------- T 6 6 ----------------------------------------------- Ttail 9 9 9 7 8 9 ----------------------------------------------- F 10 11 This table is constructed from the Predict sets described earlier. It's basically the same as the labels on the switch statements in the recursive descent parser. The only difference is that the tool used to generate the table has used Follow sets to distinguish between cases where predicting an epsilon production is a good idea and cases where predicting an epsilon production is certain to lead to an error later on. In effect, the recursive descent parser shown above has a '5' in every blank entry in the Etail row, and a '9' in every blank entry in the Ttail row. If we wanted we could use Follow sets to get earlier error detection in the recursive descent parser, too. The Etail and Ttail routines would then look like this: procedure Etail switch next_token case + match(+) T() Etail() case - match(-) T() Etail() case ), eof return default error() procedure Ttail switch next_token case * match(*) F() Ttail() case / match(/) F() Ttail() case ), eof, +, - return default error() --------------------------------------- Driver Procedure Under table-driven parsing, there is a single procedure that "interprets" the parse table. This "driver" procedure takes the following form: next_token : symbol PS : stack of symbol // explicit parsing stack PT : array[symbol, token] of production // parse table procedure parse PS.push(S) next_token := scan() while not PS.empty() do top : symbol = PS.top() if top is a nonterminal then prod : production = PT[top, next_token] if prod > 0 then PS.pop() for each symbol on RHS of prod do PS.push(symbol) else error() else if next_token == top then PS.pop() // match terminal symbol in input next_token = scan() else error()