Lecture 4: Conditionals and A-Normal Form
Our previous compiler could increment and decrement numbers, as well as handle let-bound identifiers. This is completely straight-line code; there are no decisions to make that would affect code execution. We need to support conditionals to incorporate such choices. Also, we’d like to be able to support compound expressions like binary, infix operators (or eventually, function calls), and to do that we’ll need some more careful management of data.
Let’s start with conditionals, and move on to compound expressions second.
1 Growing the language: adding conditionals
1.1 The new concrete syntax
1.2 Examples and semantics
Currently our language includes only integers as its values. We’ll therefore define conditionals to match C’s behavior: if the condition evaluates to a nonzero value, the then-branch will execute, and if the condition evaluates to zero, the else-branch will execute. It is never the case that both branches should execute.
Concrete Syntax |
| Answer |
|
| 6 |
|
| 7 |
|
| 7 |
Unlike C, though, if-expressions are indeed expressions: they evaluate to a value, which means they can be composed freely with the other expression forms in our language.
Do Now!
Construct larger examples, combining if-expressions with each other or with let-bindings, and show their evaluation.
1.3 The new abstract syntax
type expr = ...
| EIf of expr * expr * expr (* condition, then branch, else branch *)
1.4 Enhancing the transformations: Jumping around
1.4.1 Comparisons and jumps
To compile conditionals, we need to add new assembly instructions that allow us
to change the default control flow of our program: rather than proceeding
sequentially from one instruction to the next, we need jumps to
immediately go to an instruction of our choosing. The simplest such form is
just jmp SOME_LABEL
, which unconditionally jumps to the named label in
our program. We’ve seen only one label so far, namely
our_code_starts_here
, but we can freely add more labels to our program to
indicate targets of jumps. More interesting are conditional jumps,
which only jump based on some test; otherwise, they simply fall through to the
next instruction.
To trigger a conditional jump, we need to have some sort of comparison.
The instruction cmp arg1 arg2
compares its two arguments, and sets
various flags whose values are used by the conditional jump instructions:
Instruction |
| Jump if ... |
|
| ... the two compared values are equal |
|
| ... the two compared values are not equal |
|
| ... the first value is less than the second |
|
| ... the first value is less than or equal to the second |
|
| ... the first value is greater than the second |
|
| ... the first value is greater than or equal to the second |
|
| ... the first value is less than the second, when treated as unsigned |
|
| ... the first value is less than or equal to the second, when treated as unsigned |
Some conditional jumps are triggered by arithmetic operations, instead:
Instruction |
| Jump if ... |
|
| ... the last arithmetic result is zero |
|
| ... the last arithmetic result is non-zero |
|
| ... the last arithmetic result overflowed |
|
| ... the last arithmetic result did not overflow |
Do Now!
Consider the examples of if-expressions above. Translate them manually to assembly.
Let’s examine the last example above:
~hl:2:s~if ~hl:1:s~sub1(1)~hl:1:e~: ~hl:3:s~6~hl:3:e~ else: ~hl:4:s~7~hl:4:e~~hl:2:e~
.
Which of the following could be valid translations of this expression?
|
|
|
|
|
|
|
The first two follow the structure of the original expression most closely, but the second has a fatal flaw: once the then-branch finishes executing, control falls through into the else-branch when it shouldn’t. The third version flips the condition and the target of the jump, but tracing carefully through it reveals there is no way for control to reach the else-branch. Likewise, tracing carefully through the first and last versions reveal they could both be valid translations of the original expression.
Working through these examples should give a reasonable intuition for how to compile if-expressions more generally: we compile the condition, check whether it is zero and if so jump to the else branch and fall through to the then branch. Both branches are then compiled as normal. The then-branch, however, needs an unconditional jump to the instruction just after the end of the else-branch, so that execution dodges the unwanted branch.
Do Now!
Work through the initial examples, and the examples you created earlier. Does this strategy work for all of them?
Let’s try this strategy on a few examples. For clarity, we repeat the previous example below, so that the formatting is more apparent.
Original expression |
| Compiled assembly |
|
|
|
|
|
|
|
|
|
The last example is broken: the various labels used in the two if-expressions are duplicated, which leads to illegal assembly:
$ nasm -f aout -o output/test1.o output/test1.s
output/test1.s:20: error: symbol `if_true' redefined
output/test1.s:23: error: symbol `if_false' redefined
output/test1.s:25: error: symbol `done' redefined
We need to generate unique labels for each expression.
1.4.2 Approach 1: Gensym
One common approach is to write a simple function that generates unique symbols every time it’s called, by keeping track of a mutable counter:
let gensym =
let counter = ref 0 in
(fun basename ->
counter := !counter + 1;
sprintf "%s_%d" basename !counter);;
counter
in a let-expression scoped within the binding of gensym
.This approach works, is simple to implement and simple to understand. However, it
does have a readability drawback: the generated names bear no connection to the
expressions that produced them, making it hard to trace backwards from the
generated output to the relevant source expressions. Additionally, it assumes
that only one stream of names is ever needed in the compiler —gensym
in testing, as the precise numbers it generates are dependent
on the entire history of calls to gensym
, which makes writing tests very brittle.
1.4.3 Approach 2: Tagging
In the last assignment, recall that our definition of expr
was slightly
more complicated than that presented above: it was parameterized by an
arbitrary type, allowing us to stash any data we wanted at the nodes of our
AST:
type 'a expr =
| ENumber of int * 'a
| EId of string * 'a
| ELet of (string * 'a expr * 'a) list * 'a expr * 'a
| EPrim1 of prim1 * 'a expr * 'a
...
We originally used this flexibility to tag every expression with its source location information, so that we could give precisely-located error messages when scoping problems arose. But this parameter is more flexible than that: we might consider walking the expression and giving every node a unique identifier:
type tag = int
let tag (e : 'a expr) : tag expr =
let rec help (e : 'a expr) (cur : tag) : (tag expr * tag) =
match e with
| EPrim1(op, e, _) ->
let (tag_e, next_tag) = help e (cur + 1) in
(EPrim1(op, tag_e, cur), next_tag)
| ...
in
let (tagged, _) = help e 1 in tagged;;
This function is completely determined by its input, without relying on mutable state, making it much easier to work with in the context of testing. It also implicitly resets counting every time it’s called, making the successive phases of the compiler more readable and independent. Lastly, if we use these ids as the basis for our generated names, then our generated names are easily tracable back to the expressions that created them, making debugging much easier.
1.4.4 Putting it together: compiling if-expressions
If we use our decorated 'a expr
definition and our tag
function
above, then compiling if-expressions becomes:
let rec compile_expr (e : tag expr) (si : int) (env : (string * int) list) =
match e with
...
| EIf(cond, thn, els, tag) ->
let else_label = sprintf "if_false_%d" tag in
let done_label = sprintf "done_%d" tag in
(compile_expr cond) @
[
ICmp(Reg(EAX), Const(0));
IJe(else_label)
]
@ (compile_expr thn si env)
@ [ IJmp(done_label); ILabel(else_label) ]
@ (compile_expr els si env)
@ [ ILabel(done_label) ]
let compile e =
let tagged = tag e in
let compiled = compile_expr tagged 1 [] in
(* ... surround compiled with prelude as needed ... *)
1.5 Testing
As always, we must test our enhancements. Properly testing if-expressions is slightly tricky right now: we need to confirm that
We always generate valid assembly
If-expressions compose properly with each other, and with other expressions in the language.
The generated assembly only ever executes one of the two branches of an if-expression
Testing the first property amounts to testing the tag
function, to confirm
that it never generates duplicate ids in a given expression. Testing the next
one can be done by writing a suite of programs in this language and confirming
that they produce the correct answers. Testing the last requirement is
hardest: we don’t yet have a way to signal errors in our programs (for example,
the compiled equivalent of failwith "This branch shouldn't run!"
) For
now, the best we can do is manually inspect the generated output and confirm
that it is correct-by-construction, but this won’t suffice forever.
Exercise
Add a new
Prim1
operator to the language, that you can recognize and deliberately compile into invalid assembly that crashes the compiled program. Use this side-effect to confirm that the compilation of if-expressions only ever executes one branch of the expression.
2 Growing the language: adding infix operators
2.1 The new concrete syntax
2.2 Examples and semantics
These new expression forms should be familiar from standard arithmetic
notation. Note that there is no notion of operator precedence; instead, we use
the tree structure to indicate grouping. For this language, we will decide
that the order of evaluation should be leftmost-innermost: that is, in the
expression (2 - 3) + (4 * 5)
, the evaluation order should step through
(2 - 3) + (4 * 5)
==> -1 + (4 * 5)
==> -1 + 20
==> 19
rather than the possible alternative of doing the multiplication first.
2.3 Enhancing the abstract syntax
type prim2 =
| Plus
| Minus
| Times
type expr = ...
| EPrim2 of prim2 * expr * expr
We simply add a new constructor describing our primitive binary operations, and an enumeration of what those operations might be.
2.4 Enhancing the transformations: Normalization
Exercise
What goes wrong with our current naive transformations? How can we fix them?
Let’s try manually “compiling” some simple binary-operator expressions to assembly:
Original expression |
| Compiled assembly |
|
|
|
|
|
|
|
|
|
|
|
|
Do Now!
Convince yourself that using a let-bound variable in place of any of these constants will work just as well.
So far, our compiler has only ever had to deal with a single active expression
at a time: it moves the result into EAX
, increments or decrements it, and
then potentially moves it somewhere onto the stack, for retrieval and later
use. But with our new compound expression forms, that won’t suffice: the
execution of (2 - 3) + (4 * 5)
above clearly must stash the result of
(2 - 3)
somewhere, to make room in EAX
for the subsequent
multiplication. We might try to use another register (EBX
, maybe?), but
clearly this approach won’t scale up, since there are only a handful of
registers available. What to do?
2.4.1 Immediate expressions
Do Now!
Why did the first few expressions compile successfully?
Notice that for the first few expressions, all the arguments to the operators were immediately ready:
They required no further computation to be ready.
They were either constants, or variables that could be read off the stack.
Perhaps we can salvage the final program by transforming it somehow, such that all its operations are on immediate values, too.
Do Now!
Try to do this: Find a program that computes the same answer, in the same order of operations, but where every operator is applied only to immediate values.
Note that that conceptually, our last program is equivalent to the following:
let first = 2 - 3 in
let second = 4 * 5 in
first + second
This program has decomposed the compound addition expression into the sum of two let-bound variables, each of which is a single operation on immediate values. We can easily compile each individual operation, and we already know how to save results to the stack and restore them for later use, which means we can compile this transformed program to assembly successfully.
This transformation can be generalized and systematized, and thereby make the rest of compilation succeed where currently it would fail. Let’s examine it more carefully.
3 A-Normal Form
Our goal is to transform our program such that every operator is applied only to immediate values. We will call such a form Administrative Normal Form, or A-Normal Form or ANF for short. It’s worth writing a predicate to check this property for us, to formalize what we mean:
let rec is_anf (e : 'a expr) : bool =
match e with
| EPrim1(_, e, _) -> is_imm e
| EPrim2(_, e1, e2, _) -> is_imm e1 && is_imm e2
| ELet(binds, body, _) ->
List.for_all (fun (_, e, _) -> is_anf e) binds
&& is_anf body
| EIf(cond, thn, els, _) -> is_imm cond && is_anf thn && is_anf els
| _ -> is_imm e
and is_imm e =
match e with
| ENumber _ -> true
| EId _ -> true
| _ -> false
;;
(We will say that a program is "in ANF" when it satisfies this property; we will also use the word "ANF" colloquially as a verb to convert a program into A-normal form.)
Most of this function is straightforward. The definition of is_imm
captures our notion of which expressions are immediate: only numbers and
identifiers. The first two lines of is_anf
ensure that the operands to
our primitive operators are immediate. The next ensures that all the
subexpressions in a let-binding are in A-normal form. The definition for
if-expressions, though, is slightly surprising: it requires that the condition
of the if be immediate, and not just in ANF. This makes some intuitive
sense if we think of the “check if zero and branch” behavior of the
if-expression as an odd kind of Prim1 operation: an if-expression in ANF should
not also be responsible for evaluating its condition down to a value.
3.1 Converting a program to A-Normal Form
Exercise
Try to systematically define a conversion function
anf : tag expr -> unit expr
such that the resulting expression satisfiesis_anf
.
There are several different ways to define “the” conversion into
ANF.1Which brings up the obvious complaint: if it is not unique, then how
is it a normal form? The original definition of ANF was not given
algorithmically, but rather via a set of axioms to reason about equivalence of
programs. When those axioms were applied systematically, the result is indeed
unique, and hence “normal”. In practice, the theoretically pristine
definition of ANF is prohibitively expensive, due to if-expressions. The
theoretical version says that let x = if c: t else: f in body
should be
converted into if c: let x = t in body else: let x = f in body
, which
causes body
to be duplicated. If body
contains another
if-expression, then that one is duplicated twice; the general case leads to
exponential code size.
So engineering compromises are made. The essence of ANF, though, remains the
same as the pristine version: every operation in the resulting program must
operate only on immediate values. We present a simple one first, then refine
it.
The central idea is that to convert some expression e1 + e2
(or any other
operator), we must somehow obtain an immediate value describing the answer
of e1
and another describing the answer of e2
, which we can then
use for the addition. Those immediate values might be constants, in which case
we’re in luck. But if either of them are variables, then we clearly need some
context to supply the definition of those variables. That context is
going to be a list of variable bindings —
As a first guess, we might try to design our function as follows:
(* The result is a pair of an answer and a context.
The answer must be an immediate, and the context must be a list of bindings
that are all in ANF. *)
let rec anf_1 (e : tag expr) : (unit expr * (string * unit expr) list) =
match e with
...
| EPrim2(Plus, left, right, tag) ->
let (left_ans, left_context) = anf_1 left in
let (right_ans, right_context) = anf_1 right in
let temp = sprintf "plus_%d" tag in
(EId(temp, ()), (* the answer *)
left_context @ (* the context needed for the left answer to make sense *)
right_context @ (* the context needed for the right answer to make sense *)
(temp, EPrim2(Plus, left_ans, right_ans, ()))) (* definition of the answer *)
This is definitely on the right track, but it has the wrong signature: we want
to return an actual ANF expression, not an expression paired with a context.
Fortunately, it is very easy to convert the output of anf_1
to the desired
form.
Do Now!
Do this. Be careful of preserving the appropriate order of the context bindings.
Exercise
Define the case for if-expressions. How should the branches be handled? Be careful to ensure that only one branch gets executed —
including any relevant context!
This ANF conversion is somewhat sloppy: it will generate far too many temporary variables.
Do Now!
Find a simple expression that need not generate any extra variables, but for which
anf_1
generates at least one unneeded variable.
Exercise
Refine the
anf_1
function into two helper functions,anf_C
that can produce an answer that is any ANF expression, andanf_I
that can produce only immediate values as answers.
3.2 An alternate approach: Just use the stack!
One could make the argument that converting to ANF is a complicated waste of
effort. We could simply walk the tree of EPrim2
expressions, evaluate
their left arguments and push them onto the stack —
On the face of it, it is indeed simpler. But as we’ll see later, this will
cause some additional headaches, because it entails that our stack frames are
of dynamic size, growing and shrinking depending on the complexity of the
expression being evaluated. This isn’t inherently a bad thing —
Additionally, though it isn’t apparent so far, having code in A-normal form actually enables some subsequent compiler passes, like optimizations, that would be incredibly difficult to pull off otherwise. The advantages of keeping the compiler-phases less tightly coupled, along with the later benefits of having code in a normalized form, tend to make ANF the winning engineering tradeoff.
3.3 Testing
Do Now!
Run the given source programs through our compiler pipeline. It should give us exactly the handwritten assembly we intend. If not, debug the compiler until it does.
1Which brings up the obvious complaint: if it is not unique, then how
is it a normal form? The original definition of ANF was not given
algorithmically, but rather via a set of axioms to reason about equivalence of
programs. When those axioms were applied systematically, the result is indeed
unique, and hence “normal”. In practice, the theoretically pristine
definition of ANF is prohibitively expensive, due to if-expressions. The
theoretical version says that let x = if c: t else: f in body
should be
converted into if c: let x = t in body else: let x = f in body
, which
causes body
to be duplicated. If body
contains another
if-expression, then that one is duplicated twice; the general case leads to
exponential code size.
So engineering compromises are made. The essence of ANF, though, remains the
same as the pristine version: every operation in the resulting program must
operate only on immediate values.