The compilation process #2

§4. Methods of parsing. We now embark, without reservation, on the actual quest of finding this mystical parse tree. To recognize context-free languages, we make use of a DFA in addition to a stack (last-in first-out queue of unbounded size). There are two different approaches to finding the parse tree:

  • Top-down. In this approach we attempt to deduce, in forward order, the productions which are required to get from the start symbol S to the given string w.
  • Bottom-up. In this approach we start with w and try to manufacture, in reverse order, the productions required to get there from S.

In general, we will for simplicity’s sake parse what are called augmented grammars. A grammar G is augmented as follows. If S is the start symbol for G, then we add two new terminals \lhd (beginning of file) and \rhd (end of file). We then add a new nonterminal S', and use this as the new start symbol, adding a production S' \to \lhd S \rhd.

It should also be noted that we’re not going to know a priori whether our string w actually lies in the language described by the grammar.

§5. First attempt at top-down parsing. Let’s get right into this and see if we can flesh out, at least crudely, what a top-down parsing algorithm might act like. We use the terminology “TOS” to refer to the top of the stack. Here’s the first attempt:

  • When we start, the stack will consist only of the start symbol.
  • At each step, we look at TOS:
    • If it’s a terminal, pop it off and match it against the leftmost input symbol. (After this is done, of course, we chop this character off the input).
    • If it’s a nonterminal A, pop it off the stack and push \alpha onto the stack where A \to \alpha is a grammar rule.
  • Accept when the stack and input are both empty.

In this sense, we’re traversing the input string w from left to right and performing a series of stack pushes, stack pops, and matches until we’ve exhausted all of its symbols.

This works great for very nice (OK, OK, unrealistically nice) grammars. What is the major flaw? Well, it has to do with the grammar rule A \to \alpha mentioned — there may be multiple productions with A on the left hand side. So which one do we choose? Certainly we don’t want to choose to use A \to \alpha if \alpha is going to lead us to a “dead end” in trying to construct our string w.

We don’t want to simply try all the possible productions, since it could take quite a few steps before we realize we chose the wrong one — and since we’ll be running into this dilemma quite often along the way, you can probably imagine how the number of possibilities would explode in your face, and the efficiency of the algorithm would go to hell in a handbasket. As you might have guessed, it will help to look ahead in the input string, to “see what’s coming”, so that we can make a somewhat informed decision as to which production to use. Even with this in hand, there are many grammars that could singlehandedly screw us over. There are, however, enough situations where it’s sufficient to warrant studying it.

§6. The predictor table. We therefore now know that whenever TOS is a nonterminal, we’ll have to look ahead (that is, look at the first character of what remains of w). Given a nonterminal A on TOS and a character a, the predictor table is going to tell us which productions we could possibly use. That is, the predictor table has columns corresponding to terminals, and rows corresponding to nonterminals. The cells are sets of productions. The notation used to refer to the desired cell, in this case, is \mathrm{Predict}(A,a). If \mathrm{Predict}(A,a) is the empty set, we are in trouble — we’ve observed an unexpected situation, and the parsing fails.

I mentioned above that \mathrm{Predict}(A,a) is always a set of productions. In the ideal case, we are hoping for each cell in the predictor table to either be an empty set, or a singleton set. When this occurs, we say that the grammar G is an LL(1) grammar, standing for Left-to-right, Leftmost derivation, 1 symbol lookahead (if you think about it, the algorithm I described in §5 will yield a leftmost derivation). In other words, given a nonterminal A on TOS and a character a, we don’t want multiple possible productions. As a sidenote, if the maximum size of any cell in the predictor table is k, we use the term LL(k) grammar.

The predictor table also brings with it the gift of detailed parsing error messages. If we happen to look at \mathrm{Predict}(A,a) and it’s empty, then we can return a message like “Parse error at (row, column): expecting one of __, __, __”, where the blanks are filled in with characters for which the current TOS has nonempty entries in the table.

§7. Construction of the predictor table. Anyways, this predictor table thing sounds pretty awesome, but how do we actually go about constructing it? Just because a production A \to \alpha can be used to derive a string abcdef doesn’t mean that the first symbol of \alpha is necessarily an a. There are a lot of complications going on here.

As mentioned before, we only want to include a production A \to \alpha in the set \mathrm{Predict}(A,a) if \alpha can derive a string starting with a. That is, let \mathrm{First}(\alpha) be the set of all characters such that \alpha derives a \gamma. Then our intuition suggests defining

\displaystyle \mathrm{Predict}(A,a) := \{ A \to \alpha \mid a \in \mathrm{First}(\alpha) \}

But this is wrong! What if, in the derivation of our string w, the nonterminal \alpha derives \epsilon and disappears into nothing? To compensate for this we have to redefine \mathrm{Predict}(A,a) to also include the set

\displaystyle \{ A \to \alpha \mid \mathrm{Nullable}(\alpha), a \in \mathrm{Follow}(A) \}.

Of course, \mathrm{Nullable}(\alpha) is true if and only if \alpha derives the empty string. The set \mathrm{Follow}(A) consists of the terminal characters which can come immediately after A in a derivation from the start symbol.

Advertisements

About mlbaker

just another guy trying to make the diagrams commute.
This entry was posted in computer science and tagged , , , , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s