## The compilation process #2

§4. Methods of parsing. We now embark, without reservation, on the actual quest of finding this mystical parse tree. To recognize context-free languages, we make use of a DFA in addition to a stack (last-in first-out queue of unbounded size). There are two different approaches to finding the parse tree:

• Top-down. In this approach we attempt to deduce, in forward order, the productions which are required to get from the start symbol $S$ to the given string $w$.
• Bottom-up. In this approach we start with $w$ and try to manufacture, in reverse order, the productions required to get there from $S$.

In general, we will for simplicity’s sake parse what are called augmented grammars. A grammar $G$ is augmented as follows. If $S$ is the start symbol for $G$, then we add two new terminals $\lhd$ (beginning of file) and $\rhd$ (end of file). We then add a new nonterminal $S'$, and use this as the new start symbol, adding a production $S' \to \lhd S \rhd$.

It should also be noted that we’re not going to know a priori whether our string $w$ actually lies in the language described by the grammar.

§5. First attempt at top-down parsing. Let’s get right into this and see if we can flesh out, at least crudely, what a top-down parsing algorithm might act like. We use the terminology “TOS” to refer to the top of the stack. Here’s the first attempt:

• When we start, the stack will consist only of the start symbol.
• At each step, we look at TOS:
• If it’s a terminal, pop it off and match it against the leftmost input symbol. (After this is done, of course, we chop this character off the input).
• If it’s a nonterminal $A$, pop it off the stack and push $\alpha$ onto the stack where $A \to \alpha$ is a grammar rule.
• Accept when the stack and input are both empty.

In this sense, we’re traversing the input string $w$ from left to right and performing a series of stack pushes, stack pops, and matches until we’ve exhausted all of its symbols.

This works great for very nice (OK, OK, unrealistically nice) grammars. What is the major flaw? Well, it has to do with the grammar rule $A \to \alpha$ mentioned — there may be multiple productions with $A$ on the left hand side. So which one do we choose? Certainly we don’t want to choose to use $A \to \alpha$ if $\alpha$ is going to lead us to a “dead end” in trying to construct our string $w$.

We don’t want to simply try all the possible productions, since it could take quite a few steps before we realize we chose the wrong one — and since we’ll be running into this dilemma quite often along the way, you can probably imagine how the number of possibilities would explode in your face, and the efficiency of the algorithm would go to hell in a handbasket. As you might have guessed, it will help to look ahead in the input string, to “see what’s coming”, so that we can make a somewhat informed decision as to which production to use. Even with this in hand, there are many grammars that could singlehandedly screw us over. There are, however, enough situations where it’s sufficient to warrant studying it.

§6. The predictor table. We therefore now know that whenever TOS is a nonterminal, we’ll have to look ahead (that is, look at the first character of what remains of $w$). Given a nonterminal $A$ on TOS and a character $a$, the predictor table is going to tell us which productions we could possibly use. That is, the predictor table has columns corresponding to terminals, and rows corresponding to nonterminals. The cells are sets of productions. The notation used to refer to the desired cell, in this case, is $\mathrm{Predict}(A,a)$. If $\mathrm{Predict}(A,a)$ is the empty set, we are in trouble — we’ve observed an unexpected situation, and the parsing fails.

I mentioned above that $\mathrm{Predict}(A,a)$ is always a set of productions. In the ideal case, we are hoping for each cell in the predictor table to either be an empty set, or a singleton set. When this occurs, we say that the grammar $G$ is an LL(1) grammar, standing for Left-to-right, Leftmost derivation, 1 symbol lookahead (if you think about it, the algorithm I described in §5 will yield a leftmost derivation). In other words, given a nonterminal $A$ on TOS and a character $a$, we don’t want multiple possible productions. As a sidenote, if the maximum size of any cell in the predictor table is $k$, we use the term LL(k) grammar.

The predictor table also brings with it the gift of detailed parsing error messages. If we happen to look at $\mathrm{Predict}(A,a)$ and it’s empty, then we can return a message like “Parse error at (row, column): expecting one of __, __, __”, where the blanks are filled in with characters for which the current TOS has nonempty entries in the table.

§7. Construction of the predictor table. Anyways, this predictor table thing sounds pretty awesome, but how do we actually go about constructing it? Just because a production $A \to \alpha$ can be used to derive a string $abcdef$ doesn’t mean that the first symbol of $\alpha$ is necessarily an $a$. There are a lot of complications going on here.

As mentioned before, we only want to include a production $A \to \alpha$ in the set $\mathrm{Predict}(A,a)$ if $\alpha$ can derive a string starting with $a$. That is, let $\mathrm{First}(\alpha)$ be the set of all characters such that $\alpha$ derives $a \gamma$. Then our intuition suggests defining

$\displaystyle \mathrm{Predict}(A,a) := \{ A \to \alpha \mid a \in \mathrm{First}(\alpha) \}$

But this is wrong! What if, in the derivation of our string $w$, the nonterminal $\alpha$ derives $\epsilon$ and disappears into nothing? To compensate for this we have to redefine $\mathrm{Predict}(A,a)$ to also include the set

$\displaystyle \{ A \to \alpha \mid \mathrm{Nullable}(\alpha), a \in \mathrm{Follow}(A) \}.$

Of course, $\mathrm{Nullable}(\alpha)$ is true if and only if $\alpha$ derives the empty string. The set $\mathrm{Follow}(A)$ consists of the terminal characters which can come immediately after $A$ in a derivation from the start symbol.