## The compilation process #1

As part of my review for CS 241, in this post I’ll be giving the first part of my detailed summary of a typical compilation process,  the details of which will make use of the theory of computation (particularly, context-free grammars). In this article, “compilation” refers to the translation of a C-like language into MIPS assembly language.

§1. Overview of compilation. The process of compilation is broken down into the following components:

• Lexical analysis or scanning. In this stage, we take the raw source code as input, and output a sequence of tokens. This is usually done through the use of a DFA (deterministic finite automaton) whose states and transitions encode the possibilities for what tokens are allowed to “look like” in the language. If the source code contains misspelled keywords, unexpected characters and other similarly “crude” problems, the scanner will probably fail to convert them into tokens properly, and hence cause an error. Maximal munch is a common approach for tokenization.
• Syntactic analysis or parsing. In this stage, we take the sequence of tokens as input, and (hopefully) output a parse tree. There are two approaches to pulling this off, and they both require quite a bit of work. This is where context-free grammars come in: the parse tree is a detailed representation of how the grammar rules are used to derive the string (token sequence).
• Semantic analysis. In this stage, we look at the parse tree and start making sure things “make sense”, that is, performing type checks, lvalue checks and so on. We also start attaching some semantic information to the parse tree which will be used later on.
• Code generation. This is the final stage of compilation. We now have a parse tree as well as a symbol table to work with, so we can use a recursive code generation procedure to nicely generate the required assembly code.

§2. Context-free grammars. The first thing we want to talk about is a mathematical model of computation known as a context-free grammar. I mentioned DFAs above, which recognize the regular languages, but we are interested instead in CFGs for reasons which will become obvious shortly (also, they recognize a strictly larger class of languages than DFAs). Of course, the price you pay is increased complexity.

A context-free grammar $G$ consists formally of the following:

• An alphabet of so-called terminal symbols, usually denoted $\Sigma$.
• A finite, nonempty set $N$, disjoint from $\Sigma$, of so-called nonterminals.
• A finite set $P$ of productions, which take the form $A \to B$ where $A \in N$ and $B \in V^*$, where the * is the Kleene star. The vocabulary $V$ is defined to be $N \cup \Sigma$.
• An element $S \in N$ called the start symbol.

We write $\alpha A \beta \Rightarrow \alpha \gamma \beta$ if there is a production $A \to \gamma$ and say that the right-hand side immediately derives the left-hand side. Similarly, we write $\alpha A \beta \rightharpoonup \alpha \gamma \beta$ if there is a sequence of immediate derivations $\alpha A \beta \Rightarrow \delta_1 \Rightarrow \ldots \Rightarrow \delta_k \Rightarrow \alpha \gamma \beta$, and say that the right-hand side derives the left-hand side. A sequence of immediate derivations is called a derivation.

The notation $L(G)$ is used to denote the language of strings recognized by the grammar $G$, that is, the subset of $\Sigma^*$ consisting of the strings $w$ which are eventually derivable from $S$.

For example, the following grammar recognizes precisely the language of palindromes (strings invariant under reversal) over $\{ a, b, c \}$:

$\displaystyle S \to aSa \mid bSb \mid cSc \mid M$
$\displaystyle M \to a \mid b \mid c \mid \epsilon$

This is a rather compact description, but it should be clear from context that $N = \{ S, M \}$, $\Sigma = \{ a, b, c \}$, and so on. Also, $\epsilon$ as usual represents the empty string.

§3. Leftmost/rightmost derivations and ambiguity. A parse tree is a tree representation of a derivation, since a derivation is a sequence of immediate derivations. The root corresponds to the start symbol, and every subtree represents an application of a production. We want to talk about certain “standard” kinds of derivations; these are known as leftmost and rightmost derivations. The explanation is as follows. In a leftmost derivation, you always “expand” (that is, apply a production to) the leftmost nonterminal in the string as you’re going along (rightmost derivations are analogous). However, not surprisingly, there may sometimes be some “choice” when it comes to which production we apply to a nonterminal.

The definition of a leftmost derivation seems quite stringent, but due to this possibility of “choosing” different sequences of productions, we obtain the surprising result that some grammars contain strings for which there exist multiple leftmost derivations. Such grammars are called ambiguous grammars. As you probably guessed, the structure of the parse tree has a lot to do with how a string might be interpreted (for example, order of operations in an algebraic expression), so this is quite bad. We therefore often seek unambiguous grammars for languages — but it turns out that certain languages can be described only by ambiguous grammars. These awful languages are called ambiguous languages.

As an example of ambiguity, consider the following two leftmost derivations of $a+b*c$ (note that I’ve compacted several steps into one, and also left out the actual grammar description):

S => S OP S => S OP S OP S => a + b OP S => a + b * c
S => S OP S => a + S OP S => a + b * c

You can see the difference in how the variables are paired with one another. The structure of the parse tree makes this even more obvious, but I don’t have time to upload a picture of it at the moment. In the first derivation, $a+b$ is first formed, and the final step is multiplying by $c$, so this gives us the idea of $(a+b)*c$. In the second derivation, $a$ is formed on the left and then we add $b * c$. Clearly, the second derivation is the one that respects the usual algebraic order of operations: multiplication is supposed to bind more tightly than addition.