Lecture 11.1: Compilers I

From interpreters to compilers

For the past two weeks, we have been focusing on interpreters to model a program's semantics. While interpreters are very instructive, they are also an oversimplification of what is happening when we actually run a program.

In practice, most languages are actually performing some amount of compilation. Compilers move us closer to what is actually happening when code runs on hardware: instead of immediately executing code, a compiler will translate code into an intermediate representation (IR) or bytecode that is easier for a machine to run.

For instance, if we took the following OCaml program:

(* example.ml *)
let x = 1 + 2

We can have ocamlc show us the bytecode instructions that will run on the OCaml virtual machine using the -dinstr flag:

$ ocamlc -dinstr -c example.ml
    const 1
    offsetint 2
    push
    acc 0
    makeblock 1, 0
    pop 1
    setglobal Example!

Compilers are everywhere:

OCaml can produce bytecode (with ocamlc) or native code (ocamlopt)
Java source is compiled into bytecode that is run on the Java Virtual Machine.
Many languages have backends for WebAssembly, which can be run in the browser.
Many implementations of Python will produce bytecode in the form of .pyc files, a common artifact when running a Python script from the command line. Some implementations will also perform just-in-time bytecode compilation (as in the case of PyPy or Numba).
There is also the LLVM, a compiler framework that provides an IR that generalizes many hardware targets. C/C++, Rust, Haskell, Julia, and more all have LLVM-based backends.

Compilers consist of a series of functions that perform translations between source languages and target languages or intermediate representations. It's not uncommon for there to be many compiler passes in a single compiler. Each pass is responsible for producing an IR that makes programs easier to analyze or run. A simple compiler could perform the following passes:

Parse → Desugar → Type Inference/Checking → Code Generation → Optimization

Parse: translates concrete syntax into an AST
Desugar: translate some higher-level operations into lower-level ones.
Type Inference/Checking: annotates expressions with types and checks that types are correct.
Code Generation: converts our AST representation into a representation that more closely resembles bytecode, IR, or native assembly
Optimization: Makes the code more efficient to compute.

This week, we will be looking into compilers and learning how to specify that the compilers we write are correct.

A Stack-based language

Our plan is to compile our calculator language into a stack-based target language. This target language will take inspiration from real-world virtual machines like the JVM and the OCaml VM.

The model of computation in this new language is that we will apply a number of instructions operations_, each of which affects a stack of values and an environment (similar to the environment in our calculator language's semantics). Let us now introduce the instructions:

type value = int
type op = | Add | Mul | And | Or | LT | EQ
type instr =
  (* Language 1.0: push / primapp / get / set *)
  | Push of value
  | PrimApp of op
  | Set of string
  | Get of string

  (* 2.0: control flow via jumps *)
  | Jump of int
  | JumpIfNotOne of int

The instructions may be of the form:

Push v, which pushes the constant v onto the top of the stack;
PrimApp o, which pops two elements off the top of the stack, applies o to those arguments, and pushes the result of applying o onto the stack;
Set s, which pops a value off the top of the stack, and updates the environment accordingly;
Get s, which grabs the value of s in the environment, and pushes its value onto the top of the stack;
and the two control-flow operators Jump i and JumpIfNotOne i, which we will discuss later this week.

A program in our stack language is then simply a list of instructions (to be applied left to right):

type program = instr list

Let's look at some programs from our calculator language and show the equivalent stack-based program. First, let's just look at a calc program that is the constant 2. The equivalent stack program will push the number 2 onto the stack.

Calc: 2
StackLang: [Push 2]

If we wanted to add two values, we would need to push two constants onto the stack and then apply the add operator:

Calc: 2 < 4
StackLang: [Push 2; Push 4; PrimApp LT]

If we wanted to use a let-binder, we would use the following list of instructions:

Calc: let x = 6 in x 
StackLang: [Push 6; Set "x"; Get "x"]

Evaluating StackLang Programs

Let's look at the Calc expression 2 < 4, which corresponds to the following StackLang program:

[Push 2; Push 4; PrimApp LT]

In our stack language, there are several pieces of state we need to keep track of:

the list of instructions to run;
the stack (of values), which is initialized as empty; and
a heap, which is a StrMap from identifiers to values.

We put these in a tuple and call this a configuration. Evaluating the program step-by-step, we can see how the program's configuration changes:

step 0: ⟨[Push 2; Push 4; PrimApp LT],       [], {}⟩
step 1: ⟨        [Push 4; PrimApp LT],    2::[], {}⟩
step 2: ⟨                [PrimApp LT], 4::2::[], {}⟩
step 3: ⟨                          [],    1::[], {}⟩

As Push operations are executed, we add to the stack as expected, but notice that applying arguments to operations will be sensitive to the stack order. It's also important to note that, in this language, Booleans will be represented as 0 or 1 (true and false, respectively).

Let's look at one more program that is the equivalent of a let-binding let x = 6 in x from our calculator language:

stack: [Push 6; Set "x"; Get "x"]

This time we will use the heap:

step 0: ⟨[Push 6; Set "x"; Get "x"],       [], {}⟩
step 1: ⟨        [Set "x"; Get "x"],    6::[], {}⟩
step 2: ⟨                 [Get "x"],       [], {x ↦ 6}⟩
step 3: ⟨                        [],    6::[], {x ↦ 6}⟩

When a value is bound to some variable, in this language we will move it from the stack to the heap. Using Get will copy the value from the heap back onto our stack.

Notice that our stack language is more expressive than our calculator language, and there are many more ways in which we can have both good and bad behavior. For instance, our language can go wrong in many more ways. For example, if we call [PrimApp Add] without pushing onto the stack, this will result in a runtime error, since Add will expect two arguments on the stack.

Evaluating a Stack-Based Language

To define our stack-based language, we first need to evaluate operations. This will be the same as how this operation worked in Calc:

let eval_op (o : op) (vs : value list) : value =
  match o, vs with
  | Add, [n1; n2] -> (n1 + n2)
  | Mul, [n1; n2] -> (n1 * n2)
  | LT,  [n1; n2] -> (if n1 < n2 then 1 else 0)
  | EQ,  [n1; n2] -> (if n1 = n2 then 1 else 0)
  | And, [n2; n1] -> (if n1 = 1 && n2 = 1 then 1 else 0)
  | Or,  [n2; n1] -> (if n1 = 1 || n2 = 1 then 1 else 0)
  | _ -> failwith "eval_op: incorrect arguments"

As we discussed before, Booleans will be represented as 1 and 0, as opposed to true and false. (For simplicity, let's only consider binary operations for now.)

Stepping from configuration to configuration

When we run a program, we need to keep track of a configuration that consists of:

the instr list that we need to execute,
the stack of values, and
a heap of assigned identifiers.

We will represent the heap using a value StrMap.t and define our config as follows:

module StrMap = Map.Make(String)
type config = (instr list * value list * value StrMap.t)

Previously, in our calc interpreter, we wrote a recursive eval function that directly produced values. For stack programs, we will instead define a non-recursive function that computes individual steps:

let step (cfg : config) : config = ...

The base case of this function will be when we have no more instructions to evaluate. At this point, we will return to the same configuration. In all other cases, we will take the next instruction in the program and perform some additional work:

fun step (cfg : config) : config =
  match cfg with
  | ([],  vs, env) -> cfg
  | (hd::tl, vs, env) -> ...

The first few cases are straightforward: - Push v will add a value v to our stack, - Set x is the equivalent of an assignment and will move a value from the stack to the heap at x. If there is no value currently on the stack, we will encounter a runtime error, and - Get x will retrieve the heap value at x, copying this value back onto the stack.

let step (cfg : config) : config =
  match cfg with
  | ([],  vs, env) -> cfg
  | (hd::tl, vs, env) ->
    (match hd with
    | Push v -> (tl, v::vs, env)
    | Set s ->
      (match vs with
      | [] -> failwith "saw a malformed program"
      | v :: vs -> (tl, vs, StrMap.add s v env))
    | Get s ->
      (match StrMap.find_opt s env with
      | None -> failwith ("unbound identifier " ^ s)
      | Some v -> (tl, v::vs, env))
    (* ... *)

When applying operations with PrimApp, we want to use our eval_op function and place the resulting value onto the stack. eval_op will return a runtime error if the language-level assumptions do not hold, but if there are not enough arguments on the stack, we will report a stack underflow error.

    (* ... *) 
    | PrimApp op ->
      (match vs with
      | [] | [_] -> failwith "stack underflow"
      | v1::v0::vs ->
        let v = eval_op op [v0; v1] in
        (tl, v::vs, env))
    (* ... *)

Notice that the ordering of v1 and v2 is flipped. This change in ordering happens because of stack-based evaluation: we will cons new values onto a list, representing our stack, so earlier arguments will appear at the right side of our list (ie, bottom of the stack). This is an important detail when performing comparisons, specifically when writing down the definition for LT. By making the change here, we are able to keep our eval_op written in a way that is more natural and aligns with our calculator language.

Finally, we will include the Jump* statements, which we will leave until Wednesday.

    | Jump n -> failwith "part of lang2"
    | JumpIfNotOne n -> failwith "part of lang2")

Examples

Let's run a few programs in utop and observe what happens. First, the program that corresponds to the Calc expression 2 + 4:

utop[0]> step ([Push 2; Push 4; PrimApp Add], [], StrMap.empty);;
- : config = ([Push 4; PrimApp Add], [2], <abstr>)

utop[1]> step (step ([Push 2; Push 4; PrimApp Add], [], StrMap.empty));;
- : config = ([PrimApp Add], [4; 2], <abstr>)

utop[2]> step (step (step ([Push 2; Push 4; PrimApp Add], [], StrMap.empty)));;
- : config = ([], [6], <abstr>)

Unfortunately, the StrMap will not be rendered in utop, but we can see that it performs the expected operations. Now, let's see let x = 6 in x from Calc:

utop[3]> step ([Push 6; Set "x"; Get "x"], [], StrMap.empty);;
- : config = ([Set "x"; Get "x"], [6], <abstr>)

utop[4]> step (step ([Push 6; Set "x"; Get "x"], [], StrMap.empty));;
- : config = ([Get "x"], [], <abstr>)

utop[5]> step (step (step ([Push 6; Set "x"; Get "x"], [], StrMap.empty)));;
- : config = ([], [6], <abstr>)

To run the stepping function until we return a value, we can define a recursive loop function that repeatedly runs our stepper until the program is finished, and an execute function to return the final value on the top of the stack:

let rec loop (c : config) : config =
  let cfg' = step c in 
  match cfg' with 
  | ([], _, _) -> cfg' (* No more operations to execute *)
  | _ -> loop cfg' (* Otherwise, keep running *)

let execute (p : program) : value =
  match loop (p, [], StrMap.empty) with
  | (_,  [], _) -> failwith "program did not return a value."
  | (_, v :: _, _) -> v