Optimization

Last week, we compiled our calculator language into StackLang and saw a more faithful representation of what actually happens when we run our programs. We showed that our compiler was correct using the property:

∀ (e : expr) (v : calc_value),
                 eval_opt StrMap.empty e = Some v ⇒ execute (compile e) = (conv v)

In this lecture, we will look at when our compiler changes our programs to make them more efficient, as well as how to specify that these optimizations are correct.

Our StackLang from 11.3

type op = | Add | Mul | And | Or | LT | EQ
module StrMap = Map.Make(String)

type value =
  | Nat  of int
  | Bool of bool

type expr =
  | Value of value
  | Var   of string
  | Let   of string * expr * expr
  | App   of op * expr list
  | ITE   of expr * expr * expr

let eval_calc_op (op : op) (args : value list) : value =
  match (op, args) with
  | (Add, [Nat n1; Nat n2]) -> Nat (n1 + n2)
  | (Mul, [Nat n1; Nat n2]) -> Nat (n1 * n2)
  | (And, [Bool b1; Bool b2]) -> Bool (b1 && b2)
  | (Or , [Bool b1; Bool b2]) -> Bool (b1 || b2)
  | (LT , [Nat n1; Nat n2]) -> Bool (n1 < n2)
  | (EQ , [Nat e1; Nat e2]) -> Bool (e1 = e2)
  | (EQ , [Bool e1; Bool e2]) -> Bool (e1 = e2)
  | _ -> failwith "runtime error: arity mismatch or unexpected argument"

let rec eval (env : value StrMap.t) (e : expr) : value =
  match e with
  | Value v -> v
  | Var x ->
    (match StrMap.find_opt x env with
    | Some v -> v
    | None   -> failwith ("Unbound Variable: " ^ x))
  | Let (x, e1, e2) ->
    let v1 = eval env e1 in
    let env = StrMap.add x v1 env in
    eval env e2
  | ITE (g, t, f) ->
    (match eval env g with
    | Bool true  -> eval env t
    | Bool false -> eval env f
    | _ -> failwith "guard is not a bool")
  | App (op, args) -> eval_calc_op op (List.map (eval env) args)

What is Optimization?

Optimization has been around since the early days of computing, as early as IBM's FORTRAN H in the late 1960s. At a high level, optimizations rewrite programs into "equivalent" programs that make execution faster and must preserve the semantics of the program. In order for this to happen, we can either write a pure function transformation (as we will do today), or we may need to construct an intermediate representation that is easier to optimize before producing machine instruction.

One example of this is in OCaml, where we have the ocamlopt command line tool. In contrast to the ocamlc binary which generates code to be run on the OCaml VM, ocamlopt generates native machine code. Let's write a simple example file, one that we saw before in lecture 11.1:

(* example.ml*)
let x = 1 + 2

When using the native OCaml compiler ocamlopt, we can see stages of how OCaml is translated to machine code:

$ ocamlopt -dlambda -c example.ml
(seq
  (let (x/270 =[int] (+ 1 2))
    (setfield_ptr(root-init) 0 (global Example!) x/270))
  0)

Which looks a little bit like our calculator language. When we move this to an IR that is closer to machine code using the -dcmm flag, we will get an optimized output, where the call to (+ 1 2) has been compiled into the instruction (extcall "caml_initialize" "camlExample" 7 ->unit).

$ ocamlopt  -dcmm -c example.ml
cmm:
(data)
(data int 1792 global "camlExample" "camlExample": int 1)
(data
 global "camlExample.gc_roots"
 "camlExample.gc_roots":
 addr "camlExample"
 int 0)
(function camlExample.entry ()
 (extcall "caml_initialize" "camlExample" 7 ->unit) 1)

(data)

You can see this reflected in the fact that changing our exmaple.ml from 1 + 2 to 100 + 200 will result in (extcall "caml_initialize" "camlExample" 601 ->unit). There is also ocamlopt -S available if you want to see the final machine instructions on your local machine.

We can make sense of these examples through the lens of superoptimization.

Alexia Massalin in 1987. sought to understand how get to truly optimal solutions.
Her idea was to find the shortest sequence of instructions to compute a given function.
In contrast to most compiler optimizations, which aim merely to improve performance.

These optimizations happen at the very lowest level of a compiler, optimizing numeric operations into often counter-intuitive sequences of instructions that can be sometimes baffling.

More specifically, superoptimization seeks to replace loop-free sequences of instructions with equivalent sequences that are somehow "better" — usually shorter and, as a result, faster

The resulting programs often rely somewhat complex interactions between different instructions, and sometimes "startling" programs that depend on strange overlap between representations of different values. Classic examples come from code with branches is replaced with equivalent code that doesn't use branches, for our program it is perhaps counterintuitive that our 1 + 2 would be replaced with a register manipulation involving the number 7.

Optimizing Calc with Constant Propogation

While superoptimization is an interesting problem and broad in scope, today we will look at optimizating Calc and we will show that a program before and after an optimization are equivalent. For example, our simple addition from before:

let x = 1 + 2 in x

Can also be said to be equivalent to let x = 3 in x or, more directly, the program 3.

prevailing notion is that we want to reduce this to the smallest possible value

When we write an optimization pass, we want to ensure that our optimizations compute equivalent values. In this setting, "equivalent" means that both programs are evaluated and produce the same value. To this end, we can state a simple specifications of our optimizations as:

let prop_calc_verify_optimization (original : expr) (optimized : expr) : bool = ...

Given this context, how should we write this specification?

let prop_calc_verify_optimization (original : expr) (optimized : expr) : bool =
    let vor = eval_opt StrMap.empty original in
    let vop = eval_opt StrMap.empty optimized in
    vor = vop

using val eval_opt : expr -> value StrMap.t -> value optional.

Constant folding is the idea of replacing values and statically identifiable operations with the result of a direct evaluation of these programs. Crucially, these evaluations happen at optimization time as opposed to run time, producing more efficient programs after compilation.

For our addition instruction above:

Concrete Syntax: let x = 1 + 2 in x
Abstract Syntax: Let("x", App(Add, [Value (Nat 1); Value (Nat 2)]), Var "x")

We would need to first map over the tree, find the App expression, observe that Add is being applied to two values, and replace the entire expression with Value (Nat 3). Resulting in the program:

Let("x", Value (Nat 3), Var "x")

That said, this simple optimization will be blocked by let-binders or if-then-else expressions. Our constant propogation will not further compile away our let binding of "x". Our constant folding optimization will perform a tree-map, optimizing instructions away in a similar fashion. This means that if we see the following program:

Concrete Syntax: let x = 1 + (2 + 3) in x
Abstract Syntax: Let("x", App(Add, [Value (Nat 1); App(Add, [Value (Nat 2); Value (Nat 3)])]), Var "x")

We will only produce.

Let("x", App(Add, [Value (Nat 1); Value (Nat 5)]), Var "x")

In order to fully optimize our programs, we will need to re-run this optimization on our program until we reach a fix-point.

We can write down the above constant folding optimization directly in OCaml. But, unfortunately writing down a definition similar to our eval_op as before requires a lot more pattern matching.

(* does not work *)
let rec constfold_eval_op (o : op) (es : expr list) : expr =
    match o, es with
    | Add, [Value (Nat i); Value (Nat j)] -> Value (Nat (i + j))
    | Mul, [Value (Nat i); Value (Nat j)] -> Value (Nat (i * j))
    | _ -> ... (* recursion requires a different `o` for each call *)
    (* the failure case differs *)

Instead, we can simply write a top-level tree-map, adding in the additional cases for different operations we would like to evaluate:

let rec constfold_simple (e : expr) : expr = 
    match e with
    | Value v -> Value v
    | Var s -> Var s
    | Let (s, e1, e2)  -> Let (s, constfold_simple e1, constfold_simple e2)
    | ITE (e1, e2, e3) -> ITE (constfold_simple e1, constfold_simple e2, constfold_simple e3)
    | App (o, es) -> 
        match o, es with
        | Add, [Value (Nat i); Value (Nat j)] -> (Value (Nat (i + j)))
        | Mul, [Value (Nat i); Value (Nat j)] -> (Value (Nat (i * j)))
        | EQ, [Value i; Value j] -> (Value (Bool (i = j)))
        | LT, [Value (Nat i); Value (Nat j)] -> (Value (Bool (i < j)))
        | And, [Value (Bool i); Value (Bool j)] -> (Value (Bool (i && j)))
        | Or, [Value (Bool i); Value (Bool j)] -> (Value (Bool (i || j)))
        | _ -> App (o, List.map constfold_simple es)

To run our optimization until we reach our fixpoint, we recurse until our program stops getting updated:

let rec optimize_with (opt : expr -> expr) (prog : expr) : expr =
  let prog' = opt prog in
  if prog' = prog
    then prog'
    else optimize_with opt prog'

Constant propogation will also allow us to prune simple branching statements:

Concrete: if false then 7 else 10
Optimized : 10

We can add this to our constant propogation as follows:

let rec constfold_simple (e : expr) : expr = 
    match e with
    ...
    | ITE (e1, e2, e3) -> 
      (match e1 with 
      | Value (Bool true) -> constfold_simple e2 
      | Value (Bool false) -> constfold_simple e3 
      | _ -> ITE (constfold_simple e1, constfold_simple e2, constfold_simple e3))
    ...

prevailing notion is that we want to reduce this to the smallest possible value

what are the result programs after optimization?

just let bindings and values! we did this in the first example

examples!!!

Constant propagating simple substitution

let rec substitute (x : string) (e : expr) (e' : expr) : expr = 
    match e' with
    | Value v -> Value v
    | Var s -> if s = x then e else (Var s)
    | Let (s, e1', e2') -> 
        if x = s then e' else 
            Let (s, substitute x e e1', substitute x e e2')
    | ITE (e1', e2', e3') ->
        ITE (substitute x e e1', substitute x e e2', substitute x e e3')
    | App (o, es) -> App (o, List.map (substitute x e) es)

let rec constfold_simple (e : expr) : expr = 
    match e with
    | Value v -> Value v
    | Var s -> Var s
    | Let (s, e1, e2) -> 
        let e1' = constfold_simple e1 in 
        let should_substitute (e : expr) : bool = 
            match e with 
            | Var _ | Value _ -> true
            | _ -> false
        in
        if should_substitute e1' 
        then constfold_simple (substitute s e1' e2)
        else Let (s, e1', constfold_simple e2)
    | ITE (e1, e2, e3) -> 
      (match e1 with 
      | Value (Bool true) -> constfold_simple e2 
      | Value (Bool false) -> constfold_simple e3 
      | _ -> ITE (constfold_simple e1, constfold_simple e2, constfold_simple e3))
    | App (o, es) -> 
        match o, es with
        | Add, [Value (Nat i); Value (Nat j)] -> (Value (Nat (i + j)))
        | Mul, [Value (Nat i); Value (Nat j)] -> (Value (Nat (i * j)))
        | _ -> App (o, List.map constfold_simple es)

examples!!!

Constant folding in StackLang

It might be tempting to ask if we can do the same thing with Stack. We can!

Of course, we have been working with more than just arithmetic: we also have Boolean and comparison operations, as well as Jump instructions. Let's now write a more general const_fold that will optimize these operations:

let rec const_fold_simple (prog : expr) : expr =
  match prog with
  | Push a :: Push b :: AppInstr Add :: rest -> const_fold_simple (Push (a + b) :: rest)
  | Push a :: Push b :: AppInstr Mul :: rest -> const_fold_simple (Push (a * b) :: rest)
  | Push a :: Push b :: AppInstr LT  :: rest -> const_fold_simple (Push (if a < b then 1 else 0) :: rest)
  | Push a :: Push b :: AppInstr EQ  :: rest -> const_fold_simple (Push (if a = b then 1 else 0) :: rest)
  | Push a :: Push b :: AppInstr And :: rest -> const_fold_simple (Push (if a = 1 && b = 1 then 1 else 0) :: rest)
  | Push a :: Push b :: AppInstr Or  :: rest -> const_fold_simple (Push (if a = 1 || b = 1 then 1 else 0) :: rest)
  ...

Notice that, at this point, we have rewritten the functionality of our eval_op function! That said, there is more to do: we can also inline our branching statements. When we encounter a Push of 1 or 0 (true or false), we know that we can eliminate our jump instructions by manipulating the rest of the program. For instance, when we see the program from 11.3:

CalcLang: if true then 10 else 20
StackLang: [Push 1; JumpIfNotOne 2; Push 10; Jump 1; Push 20]

Statically, we know that we can always optimize this program to Push 10. If the program, instead, started with a Push 0 instruction, we would know that we could always rewrite the program as Push 20. We can generalize this into the high-level heuristic, "an if-expression with a constant true guard will only execute the truthy branch, while a constant false guard will only execute the false branch." In terms of list instructions, this is a little simpler.

Given a (pseudocode) pattern of [Push a; JumpIfNotOne n] @ n_instructions @ (Jump m :: rest), with n_instructions being of length n.

if a = 1, then we can simply execute n_instructions :: (List.drop m rest).
if a <> 1, then we only need to run rest.

In the first rule, we take advantage of the fact that we will only execute a Jump instruction based on the presence of Push 1; JumpIfNotOne happening n steps before the instruction. If a <> 1, this is much more straightforward.

In OCaml, we can update our const_fold function to the following:

let find_label prfx { lbl } prog : instr list =
  let pc_opt = List.find_index (fun instr -> instr = Label { lbl }) prog in
  match pc_opt with
  | Some pc -> List.drop pc prog
  | None -> failwith (prfx ^ ": Undefined label " ^ string_of_int lbl)

let rec const_fold_simple (prog : instr list) : instr list =
  match prog with
  | Push a :: Push b :: AppInstr Add :: rest -> const_fold_simple (Push (a + b) :: rest)
  | Push a :: Push b :: AppInstr Mul :: rest -> const_fold_simple (Push (a * b) :: rest)
  | Push a :: Push b :: AppInstr LT  :: rest -> const_fold_simple (Push (if a < b then 1 else 0) :: rest)
  | Push a :: Push b :: AppInstr EQ  :: rest -> const_fold_simple (Push (if a = b then 1 else 0) :: rest)
  | Push a :: Push b :: AppInstr And :: rest -> const_fold_simple (Push (if a = 1 && b = 1 then 1 else 0) :: rest)
  | Push a :: Push b :: AppInstr Or  :: rest -> const_fold_simple (Push (if a = 1 || b = 1 then 1 else 0) :: rest)
  | Push 1 :: JumpIfZero l :: rest ->
    (match List.find_index (fun i -> i = Label l) rest with
     | Some n ->
                       const_fold_simple (List.take (n - 1) rest) (* keep n instructions, minus the final `Jump` *)
                     @ const_fold (List.drop (n + m) rest) (* drop the instructions that we would `Jump` over *)
    | _ -> Push 1 :: JumpIfNotOne n :: const_fold rest)

  | Push _ :: JumpIfNotOne n :: rest -> List.drop n rest

  | hd :: tl -> hd :: const_fold tl
  | [] -> []

We can now see our more general const_fold at work:

utop # optimize_with const_fold [Push 1; JumpIfNotOne 2; Push 10; Jump 1; Push 20];;
- : expr = [Push 10]

utop # optimize_with const_fold [Push 1; JumpIfNotOne 4; Push 10; Push 20; AppInstr Add; Jump 1; Push 20];;
- : program = [Push 30]

utop # optimize_with const_fold [Push 0; JumpIfNotOne 4; Push 10; Push 20; AppInstr Add; Jump 1; Push 20];;
- : program = [Push 20]