Lecture 8: Revisiting ANF:   Encoding A-Normal Form with Types
1 Defining ANF expressions
2 Defining anf using these types
2.1 help  A:   obtaining full expressions
2.2 help  C:   obtaining compound expressions
2.3 help  I:   obtaining immediate expressions
3 Benefits of these types
7.5

Lecture 8: Revisiting ANF: Encoding A-Normal Form with Types

Our prior implementations of ANF conversion were functions of the form anf : tag expr -> unit expr, that produced a value of the same expr data type, but one that happened to satisfy the is_anf predicate we defined. This was certainly expedient:

But it has downsides, primarily that we lose a lot of the power and guarantees that ML’s type system could provide us. If we define our types correctly, ML can assure us that

Let’s try to define these types, then revisit the transformation functions.

1 Defining ANF expressions

Recall our definition of is_anf:

let rec is_anf (e : 'a expr) : bool =
  match e with
  | EPrim1(_, e, _) -> is_imm e
  | EPrim2(_, e1, e2, _) -> is_imm e1 && is_imm e2
  | ELet(binds, body, _) ->
     List.for_all (fun (_, e, _) -> is_anf e) binds
     && is_anf body
  | EIf(cond, thn, els, _) -> is_imm cond && is_anf thn && is_anf els
  | _ -> is_imm e
and is_imm e =
  match e with
  | ENumber _ -> true
  | EId _ -> true
  | EBool _ -> true
  | _ -> false
;;

The goal was to ensure that every primitive operation only received immediate operands, which were anything whose value was available without any further computation: namely, constants and identifiers. Now, if we try to define a new expression type for ANF expressions, we’ll want to encode that invariant in the types.

Exercise

Try to define a new expression type, 'a aexpr, that describes this ANF invariant.

It’s easiest to start from the immediate values:

type 'a immexpr = (* immediate expressions *)
  | ImmNum of int * 'a
  | ImmBool of bool * 'a
  | ImmId of string * 'a

The next level up from immediate expressions are compound expressions, which will include basically everything else:

(* Attempt #1 *)
type 'a comp_expr =
  | CImmExpr of 'a immexpr                          (* base cases *)
  | CPrim1 of prim1 * 'a immexpr * 'a               (* argument must be immediate *)
  | CPrim2 of prim2 * 'a immexpr * 'a immexpr * 'a  (* arguments must be immediate *)
  | CIf of 'a immexpr * 'a comp_expr * 'a comp_expr (* condition must be immediate *)
  | CLet of (string * 'a comp_expr) list * 'a comp_expr * 'a    (* ugh. *)

This data definition correctly encodes our stated ANF property, but that CLet definition is still pretty unwieldy. In particular, it permits multiple bindings within a single CLet (because of the list), and it allows let-bindings to appear in the binding-expressions themselves. This in turn means that our binding expressions can be arbitrarily complex, which would be nice to avoid, if possible.

We can correct the multiple bindings part just by removing the list, but correcting the other flaw requires a separation of comp_expr into two distinct, mutually recursive types:

type 'a cexpr = (* compound expressions *)
  | CIf of 'a immexpr * 'a aexpr * 'a aexpr * 'a
  | CPrim1 of prim1 * 'a immexpr * 'a
  | CPrim2 of prim2 * 'a immexpr * 'a immexpr * 'a
  | CImmExpr of 'a immexpr
and 'a aexpr = (* anf expressions *)
  | ALet of string * 'a cexpr * 'a aexpr * 'a
  | ACExpr of 'a cexpr

Note in particular that ALet accepts binding expressions that are restricted to cexpr, which cannot itself be a let-binding. Also notice that CIf expressions contain branches that are aexprs: the branches might well need let-bindings, but we don’t want to execute any of the code for the branch that isn’t taken. So we can’t lift all the let-bindings out of the if-expressions; we need to permit them to remain inside its branches.

2 Defining anf using these types

Exercise

Translate your definition of anf to produce values of type unit aexpr, instead of unit expr. Hint: one of the two versions will be a much easier starting point than the other...

From Lecture 4: Conditionals and A-Normal Form, we have an attempted signature of

let anf_without_types (e : tag expr) : unit expr =
  let rec help (e : tag expr) : (unit expr * (string * unit expr) list) =
    ...
  in ...

The intuition behind the return value of help was to return an answer expression that was an immediate, followed by a binding-list of ANF’ed expressions that formed the context in which the answer made sense. We can translate this into our new types as

let anf (e : tag expr) : unit aexpr =
  let rec help (e : tag expr) : (unit immexpr * (string * unit aexpr) list) =
    ...
  in ...

But this type is overly strict. Consider the complete program 3 + 5. In order to call help on this expression, we have to return an unit immexpr, which means we need to effectively produce a translated program of let temp = 3 + 5 in temp. But nothing in our definition of ANF requires that the final expression of our program be an immediate! We merely require that all operators have immediate operands. So we need at least one helper with a less-stringent signature, to allow us to return non-immediate answers. Fortunately, we now have an intermediate type, between “all ANF expressions” ('a aexpr) and “all immediate expressions” ('a immexpr), namely compound expressions ('a cexpr).

Note that similar analyses holds for the branches of a conditional, and the right-hand side of a let-binding. These inform our choices of where to use 'a aexpr and where to use 'a cexpr in our type definitions above.

Exercise

Complete this analysis, and justify to yourself why the data definitions above have the desired properties.

So, now that we have three types for our ANF expressions instead of just one, we’ll need three mutually recursive functions:

let anf (e : tag expr) : unit aexpr =
  let rec helpA (e : tag expr) : unit aexpr = ...
  and     helpC (e : tag expr) : (unit cexpr * (string * unit cexpr) list) = ...
  and     helpI (e : tag expr) : (unit immexpr * (string * unit cexpr) list) = ...
  in
  helpA e
;;
Again, our transformation kept track of a pair of the “answer” expression we wanted, and a list of “setup bindings” that were needed for the answer to make sense. Accordingly, helpC and helpI will have similar signatures, but their answers will be compound or immediate, respectively. By contrast, helpA produces a complete ANF expression, so its return type has no need for a setup context.

2.1 helpA: obtaining full expressions

The helpA function will be the only one that actually needs to collapse the setup lists into nested ALet expressions. To obtain such a setup list, it will need to call either helpC or helpI on its argument. A single example suffices to show which choice is better:

(* The highlighted expressions are pretty-printed exprs, not ML *)
helpI ~hl:1:s~(1 + 2)~hl:1:e~ ===> ~hl:1:s~let temp = 1 + 2 in temp~hl:1:e~

helpC ~hl:1:s~(1 + 2)~hl:1:e~ ===> ~hl:1:s~1 + 2~hl:1:e~

Since any cexpr can be treated directly as an aexpr (by using ACExpr) without introducing any unnecessary temp variables, we’d rather have helpA call helpC. Once it has done so, it obtains an answer and a setup list. We need to choose an ordering on that list that we’ll use to define order of execution. The easiest choice to read is to let the first item in the list be the first binding to execute. Accordingly, helpA needs to fold its setup list around the answer, starting from the right:

let rec anfA e : unit aexpr =
  let (ans, ans_setup) = helpC e in
  List.fold_right                                        (* from right to left, *)
    (fun (bind, exp) body -> ALet(bind, exp, body, ()))  (* wrap each binding around the body *)
    ans_setup                                            (* from the list of setup bindings *)
    (ACExpr ans)                                         (* starting from the answer *)

Do Now!

Confirm that this definition of helpA will produce the claimed answer above. Confirm that using helpI instead would produce the other answer above.

2.2 helpC: obtaining compound expressions

We sketch just some of the cases for helpC; complete the others as an exercise.

and helpC (e : tag expr) : (unit cexpr * (string * unit cexpr) list) =
  match e with
  | EPrim1(op, arg, _) ->
     (* Convert the argument to an immediate *)
     let (arg_imm, arg_setup) = helpI arg in
     (* Produce the appropriate cexpr, along with its context *)
     (CPrim1(op, arg_imm, ()), arg_setup)
  | EPrim2 _
  | EIf _
  | ELet _ -> (* complete these *)
  | _ -> let (imm, setup) = helpI in (CImmExpr imm, setup) (* just wrap the immediate value *)

2.3 helpI: obtaining immediate expressions

We sketch just some of the cases for helpI; complete the others as an exercise.

and helpI (e : tag expr) : (unit immexpr * (string * unit cexpr) list) =
  match e with
  | ENumber(n, _) -> (ImmNum(n, ()), [])  (* Immediate values are immediate *)
  | EBool(b, _) -> (ImmBool(b, ()), [])
  | EId(name, _) -> (ImmId(name, ()), [])

  | EPrim1(op, arg, tag) ->
     (* We can't just return CPrim, so create a temp variable *)
     let tmp = sprintf "unary_%d" tag in
     (* Convert the argument *)
     let (arg_imm, arg_setup) = helpI arg in
     (* Return the **temp variable**, but include the CPrim as a new setup binding *)
     (ImmId(tmp, ()), arg_setup @ [(tmp, CPrim1(op, arg_imm, ()))])
  | EPrim2 _
  | EIf _
  | ELet _ -> (* complete these *)

3 Benefits of these types

Simply by compiling this code, ML’s totality checking will guarantee for us that we’ve covered all the expr cases, in each match expression of helpA, helpC and helpI, which means our translation didn’t miss any expression forms. This was true of the simpler transformation last time, too. However, all subsequent compilation functions that now work over aexprs also get guarantees that we haven’t missed any cases. With the previous code, our compile function would look something like

let rec compile (e : tag expr) si env : instruction list =
 match e with
 | ... (* various cases *)
 | _ -> failwith "Not in ANF"

This last catch-all case basically precludes ML from telling us we missed a valid ANF expression, which means we must resort to a thorough test suite to check whether we ever hit that case unexpectedly. By contrast, every expression of type aexpr is, by definition, a valid ANF expression, so we can avoid this catch-all, and get compile-time checking back. For a language as small as ours currently is, this benefit is admittedly small, but as languages grow to dozens of expression forms or more, having this totality checking is a very useful safety net.