CSC 173
Tuess. Nov 5, 2002

Read AU, Ch 11

----------
------------------------------------------------------------------------

First Sets

To distinguish two productions with the same non-terminal on the left
hand side, we examine the First sets for their corresponding right hand
sides.

We do this in 3 steps

(1) figure out which non-terminals can generate epsilon
(2) figure out FIRST sets for all non-terminals
(3) figure out FIRST sets for right-hand sides

Steps (1) and (2) start with "obvious" facts from the grammar and
iterate until they can't learn any more.  Consider step (1).  If
we have
    A --> epsilon
    B --> epsilon
then clearly A and B are symbols that can generate epsilon.  These are
the "obvious" facts.  Then in a second pass over the grammar, if we have
    C --> A B
we can deduce that C is a symbol that can generate epsilon.  If we have
    D --> C A B
then in a third pass we can deduce that D is a symbol that can generate
epsilon.  We continue this process until we make a complete pass over
the grammar without learning anything.

Now consider step (2).  If we have
    A --> b C D
    B --> c D e
then clearly b is an element of FIRST(A) and c is an element of FIRST(B).
These are obvious facts.  Then in a second pass if we have
    C --> B A d
clearly c is an element of FIRST(C), because it's an element of FIRST(B)
and a C can start with a B.  But if B can generate epsilon, then b is
also an element of FIRST(C), because we can erase the B and generate the
b from A.  In each pass over the grammar we work our way through each
RHS, adding elements to the FIRST set of the LHS, until we find a
symbol in the RHS that cannot generate epsilon, at which point we
move on to the next production.  As in step (1) we keep making passes
until we don't learn anything new.

Finally, in step (3) we use our knowledge of FIRST sets for individual
symbols to calculate FIRST sets for RHSs.  Given the production
A --> X1...Xm we must determine First(X1...Xm).

We first consider the leftmost symbol, X1.

   * If this is a terminal symbol, then First(X1...Xm) = X1.
   * If X1 is a non-terminal, then we compute the First sets for each
     right hand side corresponding to X1.

In our expression grammar above:

    First(E) = First(T Etail)
    First(T Etail) = First(T)
    First(T) = First(F Ttail)
    First(F Ttail) = First(F) = {(,num}

If X1 can generate epsilon, then X1 can (in effect) be erased, and
First(X1...Xm) depends on X2.

   * If X2 is a terminal, it is included in First(X1...Xm).
   * If X2 is a non-terminal, we compute the First sets for each of its
     corresponding right hand sides.

Similarly, if both X1 and X2 can produce epsilon, we consider X3, then
X4, etc.

It is possible that X1, X2, ..., Xm can *all* produce epsilon.  What
then?  The informal answer is that we should predict A --> X1...Xm if
the lookahead symbol can come *after* an A in some line of the
derivation.  A formal treatment of this subject requires the notion of
so-called Follow sets for symbols.  In practice, we don't generally have
to know about Follow sets when building a recursive-descent parser.
Suppose we have three productions for A:

    A --> B c D
    A --> e f
    A --> G H

where G and H can both generate epsilon.  Our parsing routine then says:

    A() {
        switch (next_token) {
            case First(BcD):
                B
                match(c)
                D
            case e:
                match(e)
                match(f)
            default:
                G
                H

If next_token is not in First(BcD) U {e}, we assume we can use the third
production.  If it turns out that next_token is not in First(GH) U
Follow(A) either, then this was a bad decision, but nothing catastrophic
happens: the calls to G and H will go ahead and generate epsilon, we'll
return, and our caller will announce a syntax error -- just a bit later
than we could have.

---------------------------------------

LL(1) Grammars for Recursive-Descent Parsing
(also known as "top-down" or "predictive" parsing)

Recursive-descent parsing can only parse those CFG's that have disjoint
predict sets for productions that share a common left hand side.  CFG's
that obey this restriction are called LL(1).

From experience we know that it is usually possible to create an LL(1)
CFG for a programming language. However, not all CFG's are LL(1) and a
CFG that is not LL(1) may be parsable using some other (usually more
complex) parsing technique.

Two common properties of grammars that produce trouble for top-down
parsing are:

   * Left recursion: any grammar containing productions with left
     recursion, that is, productions of the form A --> A X1...Xm, cannot
     be LL(1). The problem is that any symbol that predicts this
     production the first time will, of necessity, continue to predict
     this production forever (and never be matched).

   * Common prefix: any grammar containing two productions for the same
     non-terminal that share a common prefix on the right hand side
     cannot be LL(1). The problem is that any symbol that predicts the
     first production must also predict the second; since the predict
     sets for the two productions are not disjoint, the grammar is not
     LL(1).

---------------------------------------

Creating an LL(1) Grammar

Consider the following grammar for expressions:

  1. E --> E + T
  2. E --> E - T
  3. E --> T
  4. T --> T * F
  5. T --> T / F
  6. T --> F
  7. F --> ( E )
  8. F --> number

This grammar has left recursion, and therefore cannot be LL(1). We can
replace the use of left recursion with right recursion as follows:

  1. E --> T + E
  2. E --> T - E
  3. E --> T
  4. T --> F * T
  5. T --> F / T
  6. T --> F
  7. F --> ( E )
  8. F --> number

The resulting grammar is still not LL(1); productions 1-3 share a common
prefix, as do productions 4-6. We can eliminate the common prefix by
defering the decision as to which production to pick until after seeing
the common prefix. This technique is called factoring the common prefix.

  1. E --> T Etail
  2. Etail --> + T Etail | - T Etail | epsilon
  3. T --> F Ttail
  4. Ttail --> * F Ttail | / F Ttail | epsilon
  5. F --> ( E ) | number

And this is, of course, our top-down grammar for expressions.

WARNING: while it is possible to mechanically eliminate left recursion
and common prefixes from a grammar, this is not guaranteed to make the
result LL(1).  Some languages just can't be parsed top-down.  Here's a
grammar for one:

    G --> a B b
    G --> a C c
    B --> a B b
    B -->
    C --> a C c
    C -->

The language consists of all strings of a's followed by an equal number
of b's or by an equal number of c's.  Left factoring doesn't solve the
problem (try it!).

------------------------------------------------------------------------

Table-Driven Parsing

In recursive-descent parsing, the decision as to which production to
choose for a particular non-terminal is hard-coded into the procedure
for the non-terminal.

The problem with recursive-descent parsing is that it is inflexible;
changes in the grammar can cause significant (and in some cases
non-obvious) changes to the parser.

Since recursive-descent parsing uses an implicit stack of procedure
calls, it is possible to replace the parsing procedures and implicit
stack with an explicit stack and a single parsing procedure that
manipulates the stack.

In this scheme, we encode the actions the parsing procedure should take
in a table. This table can be generated automatically (with the grammar
as input), which is why this approach adapts more easily to changes in
the grammar.  (BTW, we could automatically generate a recursive-descent
parser, but that's no easier, and it's likely to be a little slower, so
nobody bothers.)

Note the analogy to scanning: table-driven top-down parsers are to
recursive descent parsers as table-driven scanners are to
nested-switch-statement scanners.

---------------------------------------

A Table-Driven Parser

The parse table encodes the choice of production as a function of the
current non-terminal of interest and the lookahead symbol.

T: Non-terminals x Terminals -> Productions U {Error}

The entry T[A,x] gives the production number to choose when A is the
non-terminal of interest and x is the current input symbol. The table is
a mapping from non-terminals x terminals to productions.

T[A,x] == A -> X1..Xm if x in Predict(A->X1..Xm)
otherwise T[A,x] == Error

The driver procedure is very simple. It stacks symbols that are to be
matched or expanded. Terminal symbols on the stack must match an input
symbol; non-terminal symbols are expanded according to the Predict sets,
which are embedded in the parse table.  The Predict set for a given
production is basically the First set for its RHS.  The possible
exception arises for epsilon productions (including productions that can
generate epsilon indirectly): for these we can, if we want, use Follow
sets (mentioned but not described above) to avoid predicting an epsilon
production when it's certain to lead to an error later on.

---------------------------------------

Parse Table for Expressions

Here is an LL(1) expression grammar, augmented to include the end
marker:

  1. S --> E eof
  2. E --> T Etail
  3. Etail --> + T Etail
  4. Etail --> - T Etail
  5. Etail --> epsilon
  6. T --> F Ttail
  7. Ttail --> * F Ttail
  8. Ttail --> / F Ttail
  9. Ttail --> epsilon
 10. F --> ( E )
 11. F --> number

The table for this expression grammar is as follows, where a blank entry
corresponds to an error:

           (   )   +   -   *   /   Number   eof
  -----------------------------------------------
  S        1                         1
  -----------------------------------------------
  E        2                         2
  -----------------------------------------------
  Etail        5   3   4                     5
  -----------------------------------------------
  T        6                         6
  -----------------------------------------------
  Ttail        9   9   9   7   8             9
  -----------------------------------------------
  F        10                        11

This table is constructed from the Predict sets described earlier.  It's
basically the same as the labels on the switch statements in the
recursive descent parser.  The only difference is that the tool used to
generate the table has used Follow sets to distinguish between cases
where predicting an epsilon production is a good idea and cases where
predicting an epsilon production is certain to lead to an error later on.

In effect, the recursive descent parser shown above has a '5' in every
blank entry in the Etail row, and a '9' in every blank entry in the
Ttail row.  If we wanted we could use Follow sets to get earlier error
detection in the recursive descent parser, too.  The Etail and Ttail
routines would then look like this:

    procedure Etail
        switch next_token
            case +
                match(+)
                T()
                Etail()
            case -
                match(-)
                T()
                Etail()
            case ), eof
                return
            default
                error()

    procedure Ttail
        switch next_token
            case *
                match(*)
                F()
                Ttail()
            case /
                match(/)
                F()
                Ttail()
            case ), eof, +, -
                return
            default
                error()

---------------------------------------

Driver Procedure

Under table-driven parsing, there is a single procedure that
"interprets" the parse table. This "driver" procedure takes the
following form:

next_token : symbol
PS : stack of symbol                        // explicit parsing stack
PT : array[symbol, token] of production     // parse table

procedure parse
    PS.push(S)
    next_token := scan()
    while not PS.empty() do
        top : symbol = PS.top()
        if top is a nonterminal then
            prod : production = PT[top, next_token]
            if prod > 0 then
                PS.pop()
                for each symbol on RHS of prod do
                    PS.push(symbol)
            else
                error()
        else if next_token == top then
            PS.pop()      // match terminal symbol in input
            next_token = scan()
        else
            error()