Lecture 14: Tuples and Memory Allocation
1 Computing with mutable data
2 Mutating pairs in our language
2.1 Syntax and examples
2.2 Compilation
2.3 Type checking mutation
2.3.1 Hazards
2.3.2 Type-checking Approach
2.3.3 Type-inference Approach
3 Generalizing to tuples
4 Sequencing
4.1 Approach 1:   explicit support
4.2 Approach 2:   desugaring
5 Recursive types, nil, and “real” programs
7.9

Lecture 14: Tuples and Memory Allocation

1 Computing with mutable data

Standard reminder: Every time we enhance our source language, we need to consider several things:

  1. Its impact on the concrete syntax of the language

  2. Examples using the new enhancements, so we build intuition of them

  3. Its impact on the abstract syntax and semantics of the language

  4. Any new or changed transformations needed to process the new forms

  5. Executable tests to confirm the enhancement works as intended

2 Mutating pairs in our language

2.1 Syntax and examples

We’ll write pairs as

‹expr›: ... | ( ‹expr› , ‹expr› ) | fst ‹expr› | snd ‹expr› | set-fst ‹expr› ‹expr› | set-snd ‹expr› ‹expr›

As before, the first expression creates a pair, and the next two access the first or second elements of that pair. The two new expression forms allow us to modify the first or second item of the pair given by the first subexpression, and set it to the value of the second subexpression. We need to decide on what these expressions will return; let’s choose to make them return the tuple itself.

let x = (4, 5) in set-fst x 10             # ==> (10, 5)
let y = (3, 2) in set-fst (set-snd y 8) 6  # ==> (6, 8)

Do Now!

Flesh out the semantics of these two new expressions. What new error conditions might there be?

We can add these expression forms to our AST easily enough. This time, the new expressions will be binary primitives:

type expr = ...
  | EPair of expr * expr
type prim1 = ... | Fst | Snd
type prim2 = ... | Setfst | Setsnd

2.2 Compilation

Given our representation of pairs (from Adding pairs to our language), it’s relatively straightforward to compile these two new primitives. We’ll focus on set-fst; the other is nearly identical. We first compile both subexpressions, and load them into two registers RAX and RDX. We ensure that the first argument is a pair by checking its tag, and if so then we untag the value by subtracting the tag. Now the content of the pair begins at [RAX], and the first item of the pair is located at offset zero from RAX: we simply move the second value into that location. However, our intended semantics is to return the pair value, which means we need to restore RAX to point to the pair: we must tag the value again.

mov RAX, <the pair>
mov RDX, <the new first value>
mov RCX, RAX           ;; \
and RCX, 0x7           ;; | check to ensure
cmp RCX, 0x1           ;; | RAX is a pair
jne not_a_pair         ;; /
sub RAX, 0x1           ;; untag the value into a raw pointer
mov [RAX + 8 * 0], RDX ;; perform the mutation
add RAX, 0x1           ;; tag the pointer back into a value

To compile set-snd, we do the same thing except use an offset address of [RAX + 8 * 1].

2.3 Type checking mutation

2.3.1 Hazards

To ensure that our programs continue to make sense even in the presence of mutation, we need to make sure that our updates do not modify the types of any components of pairs. Permitting this would lead to glaring type safety failures, as in the following simple example:

let x = (1, 2), # x : (Int * Int)
    y = x,      # y : (Int * Int)
    also_x = set-fst x true in
(fst y) + (snd y) # fine -- after all, y : Int * Int, right??

Even if we kept track of the fact that we’ve modified the type of x, it’s pretty challenging to perfectly keep track of the fact that y is aliased to x and therefore that its type should have changed as well.1This is known as a may-alias problem, and aliasing analyses of various precision are at the core of most optimizing compilers. (Some languages, such as Rust, track this aliasing information as part of their type system; programming in such languages is necessarily trickier—because the compiler demands more from the programmer—but provides benefits as well.) If we could somehow keep track of the fact that no-one currently aliases our value, we might be able to perform a strong update that changes the type of our pair, but for now, the effort simply isn’t worth it.

Another complication arises in languages with relationships between types. Consider declaring a variable at some interface type in Java, and then mutating that variable to hold an object of some class that implements the interface, and then mutating it again to hold an object of some other class. Java’s type system ensures that we can only access the methods declared on the interface, and so the mutations do not break type soundness. These subtyping relationships between types can be added to our type system, but we do not currently have any types that could be related in this way.

We do however have polymorphic types; as soon as we have higher-order functions, this becomes a point worth revisiting... Unsurprisingly, polymorphism and mutation do not play nicely together either; this has led to common compromises such as the value restriction to prevent references from unsafely being made polymorphic. (Note that in ML, tuples are considered values, but ref cells are not: in our language, now that pairs are mutable, we cannot consider them to be values either.)

2.3.2 Type-checking Approach

Our type checker has enough information to type-check this set-fst expression. Since the result of set-fst e1 e2 is the result of e1, we simply type-check that expression with our current desired type. If it passes, we double-check that it is indeed a pair type, and then check e2 with the first component of that type. Type-checking set-snd is analogous:

\begin{equation*}\dfrac{\Gamma \vdash e_1 : \mathsf{\tau} \quad \tau = \alpha * \beta \quad \Gamma \vdash e_2 : \mathsf{\alpha}}{\Gamma \vdash (\operatorname{set-fst} e_1 e_2) : \mathsf{\tau}} \qquad\qquad \dfrac{\Gamma \vdash e_1 : \mathsf{\tau} \quad \tau = \alpha * \beta \quad \Gamma \vdash e_2 : \mathsf{\beta}}{\Gamma \vdash (\operatorname{set-snd} e_1 e_2) : \mathsf{\tau}}\end{equation*}

2.3.3 Type-inference Approach

The last two typing rules above work directly for our inference algorithm. To infer a type for set-fst e1 e2, we infer some type \(\tau\) for the pair e1 and unify it with the pair type \(\alpha * \beta\) (for two fresh type variables), infer a type \(\gamma\) for e2, and unify \(\alpha\) with \(\gamma\). Then the result of type-inference is the pair-type \(\tau = \alpha * \beta\).

3 Generalizing to tuples

Do Now!

What goes wrong for type-checking and/or type-inference of mutable general tuples?

The core ideas above carry through unchanged, in that tuples will have product types with more components in them than mere pairs, and the same problems arise as with immutable tuples: we don’t know the tuple sizes in advance. Therefore our tuple-setting expressions also must include the size of the tuple, just as tuple-accesses did.

4 Sequencing

Now that we have mutation, we need to reconsider the ergonomics of our language. It’s rare that assigning to a field of a tuple should be the only thing we want to compute: we likely want to mutate a field and keep going with our computation. These mutations therefore fit better into our language as statements to be executed for their side-effects, rather than as expressions to be evaluated for their answer. To achieve this, we might want to express the sequencing of multiple expressions, such that our program evaluates them all in order, but only returns the final result. We can add such concrete syntax easily:

‹expr›: ... | ‹expr› ; ‹expr›

Do Now!

How might we implement support for this feature? Which phases of the compiler necessarily change, and which could we avoid changing?

We’ll start by adding an ESeq constructor to our expressions. We have two options for how to proceed from here.

4.1 Approach 1: explicit support

Since we said that our language should evaluate both subexpressions and then evaluate to the result of the second one,

e1 ; e2

intuitively should mean the same thing as

let DONT_CARE = e1 in e2

Let’s try to push this expression through ANFing. We might start by copying the code from ELet, and then just “leaving out” the binding for the intermediate variable:

let anf (p : tag program) : unit aprogram =
  let rec helpC (e : tag expr) : (unit cexpr * (string * unit cexpr) list) =
    match e with
    | ESeq(e1, e2, tag) ->
      let (e1_ans, e1_setup) = helpC e1 in
      let (e2_ans, e2_setup) = helpC e2 in
      (e2_ans, e1_setup @ [(* skip binding of DONT_CARE to e1_ans *)] @ e2_setup)
    ...

But unfortunately this translation is wrong: consider the program print(1); print(2). Since the arguments to both calls are immediate, there is no setup needed, so e1_setup = e2_setup = []. Then we return just the answer of e2, and have lost the context for evaluating e1! The solution is to require that e1 be translated via helpI, rather than helpC: by requiring that our answer be immediate, we guarantee that the expression has been fully evaluated and the answer is placed into RAX; we can then neglect to save that value anywhere.

4.2 Approach 2: desugaring

Rather than create an explicit expression form, perhaps we could reuse the existing ELet form and use our intuitive definition as the actual definition. We can enhance our desugaring function from last time to look for uses of ESeq and rewrite them into uses of ELet. This approach says that sequencing is merely syntactic sugar, a convenience that makes programs nicer to write, but that doesn’t add any new expressive power.

let desugar (p : tag program) : tag program =
  let rec helpE (e : tag expr) =
    match e with
    | ESeq(e1, e2, tag) ->
      let newname = gensym "DONT_CARE" in
      ELet([(newname, helpE e1, tag)], helpE e2, tag)
    | ...

This pass must recur throughout the program, as all of our passes do, and rewrite any sequence expressions it encounters. The advantage of this approach is that our subsequent passes can completely ignore the new sequencing expression form (they can throw an InternalCompilerException instead), and therefore no other part of our language is affected.

We can clean this up still further, by adding a new kind of binding form, BBlank, that acts like the underscore binding in ML: it indicates that we want to evaluate the right-hand expression, but not bother to bind it to any name.

type 'a bind =
  | BName of string * 'a typ * 'a
  | BTuple of 'a bind list * 'a
  | BBlank of 'a typ * 'a
type 'a binding = ('a bind * 'a expr * 'a)

Our desugaring now will generate a BBlank binding:

let desugar (p : tag program) : tag program =
  let rec helpE (e : tag expr) =
    match e with
    | ESeq(e1, e2, tag) ->
      ELet([(BBlank(TyBlank tag, tag), helpE e1, tag)], helpE e2, tag)
    | ...

Lastly, we need to add support to ANF to handle these new kinds of bindings, taking note of the difficulty we encountered above and being careful to use helpI as needed.

let anf (p : tag program) : unit aprogram =
  let rec helpC (e : tag expr) : (unit cexpr * (string * unit cexpr) list) =
    match e with
    | ELet((BName(name, _, _), exp, _)::rest, body, pos) ->
      let (exp_ans, exp_setup) = helpC exp in
      let (body_ans, body_setup) = helpC(ELet(rest, body, pos)) in
      (body_ans, exp_setup @ [(name, exp_ans)] @ body_setup)
    | ELet((BBlank _, exp, _)::rest, body, pos) ->
      let (exp_ans, exp_setup) = helpI exp in (* NOTE: use of helpI *)
      let (body_ans, body_setup) = helpC(ELet(rest, body, pos)) in
      (body_ans, exp_setup @ body_setup) (* NOTE: no binding for exp_ans *)
    | ...
  and helpI (e : tag expr) : (unit immexpr * (stirng * unit cexpr) list) =
    match e with
    | ELet((BName(bind, _, _), exp, _)::rest, body, pos) ->
       let (exp_ans, exp_setup) = helpC exp in
       let (body_ans, body_setup) = helpI (ELet(rest, body, pos)) in
       (body_ans, exp_setup @ [(bind, exp_ans)] @ body_setup)
    | ELet((BBlank(_, _), exp, _)::rest, body, pos) ->
       let (exp_ans, exp_setup) = helpI exp in (* NOTE: use of helpI *)
       let (body_ans, body_setup) = helpI (ELet(rest, body, pos)) in
       (body_ans, exp_setup @ body_setup) (* NOTE: no binding for exp_ans *)
    | ...
  ...

Do Now!

Go back through the standard recipe for adding a feature to our language, and add support for underscore-bindings as needed. Where else might underscores appear, if we treat them as bindings?

This effectively combines the two approaches above: we desugar ESeq to ELet using a BBlank binding, then our general-purpose ANF transformation handles the ignorable underscore-binding and produces exactly the results we want, in a general manner that allows us to use underscores in a flexible manner throughout our code.

Exercise

Unfortunately, using either of these two approaches, the rest of our compiler is unchanged, and is correct, yet programs that use sequencing might still not compile. Why? What might be done about this? Is it a failure in the design of sequencing, or a consequence of multiple language features interacting?

5 Recursive types, nil, and “real” programs

Now that we have mutable tuples, and tuples can contain other tuples, we have the ability to work with arbitrarily large data. However, our static type system is too weak to allow us to do this! Suppose we wanted to encode a simple linked list of integers. We need some value to indicate the end of the list, but what should its type be?

In the untyped (or dynamically-typed) case, we could encode a linked list as pairs whose first items are the elements of the list, whose second items are the rest of the list, and where the end of the list is false. For instance:

def length(l):
  if istuple(l): 1 + length(l[1 of 2])
  else 0

But there currently is no way for us to type-check this function. We would have to specify the type of l as a nested tuple type, and then that type would implicitly have a finite size. We need a recursive type to indicate arbitrarily long lists. But such types need a base case, and false already is a boolean, so we need some new value to indicate our base.

We will resolve this for our languages by introducing two features at once: recursive type synonyms, and a unique new value nil. By way of examples:

type intlist = (int * intlist)

def length(l : intlist):
  if l == (nil : intlist): 0
  else: 1 + length(l[1 of 2])

We’ll define type synonyms as giving a name to a particular tuple type, where the name itself is in scope and permitted to be used within that tuple. We’ll collect these type synonyms in an environment, for use during type checking or type inference.

Additionally, we will add a typing rule that says nil is permitted to have any tuple type at all. (However, to keep type checking and type inference easier, we will mandate that all uses of nil are syntactically annotated with their intended type.)

To represent nil, we will simply choose a NULL pointer from C, but tag it as a tuple. This will allow us to easily evaluate istuple(nil) == true, while also ensuring that nil is guaranteed to be distinct from any valid tuple value in our language.

To use type synonyms in our language, we will simply add a case to unify: if we encounted a TyCon during unification, we look it up in our environment of type synonyms, and continue unification with the expanded version of the type. This rule is suspiciously simple, and in a more complicated system than ours, it would not suffice (and might cause the type inference algorithm to loop forever), but for our purposes it works just fine.

Do Now!

Now that we’ve added nil, what new problems do we have to handle in our compilation of tuples everywhere? Do it.

We know that nil is a very expensive mistake, and that better solutions exist. A full treatment of algebraic data types, such as OCaml’s type definitions, would allow us to support recursive type definitions without resorting to introducing nil values. But that approach is beyond the scope of this course.

Exercise

Now that we have recursive, mutable data, write a few examples of programs that do something interesting, like sorting a list in place, or building a binary tree, etc. Rejoice in your newfound mastery over simple programs! What features should we add to our language next?

1This is known as a may-alias problem, and aliasing analyses of various precision are at the core of most optimizing compilers.