Lecture 14: Tuples and Memory Allocation
1 Computing with mutable data
Standard reminder: Every time we enhance our source language, we need to consider several things:
Its impact on the concrete syntax of the language
Examples using the new enhancements, so we build intuition of them
Its impact on the abstract syntax and semantics of the language
Any new or changed transformations needed to process the new forms
Executable tests to confirm the enhancement works as intended
2 Mutating pairs in our language
2.1 Syntax and examples
We’ll write pairs as
‹expr› ... ( ‹expr› , ‹expr› ) fst ‹expr›snd ‹expr› set-fst ‹expr› ‹expr›set-snd ‹expr› ‹expr›
As before, the first expression creates a pair, and the next two access the first or second elements of that pair. The two new expression forms allow us to modify the first or second item of the pair given by the first subexpression, and set it to the value of the second subexpression. We need to decide on what these expressions will return; let’s choose to make them return the tuple itself.
let x = (4, 5) in set-fst x 10 # ==> (10, 5)
let y = (3, 2) in set-fst (set-snd y 8) 6 # ==> (6, 8)
Do Now!
Flesh out the semantics of these two new expressions. What new error conditions might there be?
We can add these expression forms to our AST easily enough. This time, the new expressions will be binary primitives:
type expr = ...
| EPair of expr * expr
type prim1 = ... | Fst | Snd
type prim2 = ... | Setfst | Setsnd
2.2 Compilation
Given our representation of pairs (from Adding pairs to our language), it’s
relatively straightforward to compile these two new primitives. We’ll focus on
set-fst
; the other is nearly identical. We first compile both
subexpressions, and load them into two registers RAX
and RDX
. We
ensure that the first argument is a pair by checking its tag, and if so then we
untag the value by subtracting the tag. Now the content of the pair begins at
[RAX]
, and the first item of the pair is located at offset zero from
RAX
: we simply move the second value into that location. However, our
intended semantics is to return the pair value, which means we need to restore
RAX
to point to the pair: we must tag the value again.
mov RAX, <the pair>
mov RDX, <the new first value>
mov RCX, RAX ;; \
and RCX, 0x7 ;; | check to ensure
cmp RCX, 0x1 ;; | RAX is a pair
jne not_a_pair ;; /
sub RAX, 0x1 ;; untag the value into a raw pointer
mov [RAX + 8 * 0], RDX ;; perform the mutation
add RAX, 0x1 ;; tag the pointer back into a value
To compile set-snd
, we do the same thing except use an offset
address of [RAX + 8 * 1]
.
2.3 Type checking mutation
2.3.1 Hazards
To ensure that our programs continue to make sense even in the presence of mutation, we need to make sure that our updates do not modify the types of any components of pairs. Permitting this would lead to glaring type safety failures, as in the following simple example:
let x = (1, 2), # x : (Int * Int)
y = x, # y : (Int * Int)
also_x = set-fst x true in
(fst y) + (snd y) # fine -- after all, y : Int * Int, right??
Even if we kept track of the fact that we’ve modified the type of
x
, it’s pretty challenging to perfectly keep track of the fact that
y
is aliased to x
and therefore that its type should
have changed as well.1This is known as a may-alias problem, and
aliasing analyses of various precision are at the core of most optimizing
compilers. (Some languages, such as Rust, track this aliasing information as
part of their type system; programming in such languages is necessarily
trickier—
Another complication arises in languages with relationships between types. Consider declaring a variable at some interface type in Java, and then mutating that variable to hold an object of some class that implements the interface, and then mutating it again to hold an object of some other class. Java’s type system ensures that we can only access the methods declared on the interface, and so the mutations do not break type soundness. These subtyping relationships between types can be added to our type system, but we do not currently have any types that could be related in this way.
We do however have polymorphic types; as soon as we have higher-order
functions, this becomes a point worth revisiting... Unsurprisingly,
polymorphism and mutation do not play nicely together either; this has led to
common compromises such as the
value restriction to prevent
references from unsafely being made polymorphic. (Note that in ML, tuples are
considered values, but ref
cells are not: in our language, now that pairs
are mutable, we cannot consider them to be values either.)
2.3.2 Type-checking Approach
Our type checker has enough information to type-check this set-fst
expression. Since the result of set-fst e1 e2
is the result of
e1
, we simply type-check that expression with our current desired
type. If it passes, we double-check that it is indeed a pair type, and then
check e2
with the first component of that type. Type-checking
set-snd
is analogous:
\begin{equation*}\dfrac{\Gamma \vdash e_1 : \mathsf{\tau} \quad \tau = \alpha * \beta \quad \Gamma \vdash e_2 : \mathsf{\alpha}}{\Gamma \vdash (\operatorname{set-fst} e_1 e_2) : \mathsf{\tau}} \qquad\qquad \dfrac{\Gamma \vdash e_1 : \mathsf{\tau} \quad \tau = \alpha * \beta \quad \Gamma \vdash e_2 : \mathsf{\beta}}{\Gamma \vdash (\operatorname{set-snd} e_1 e_2) : \mathsf{\tau}}\end{equation*}
2.3.3 Type-inference Approach
The last two typing rules above work directly for our inference algorithm. To
infer a type for set-fst e1 e2
, we infer some type \(\tau\) for the
pair e1
and unify it with the pair type \(\alpha * \beta\) (for
two fresh type variables), infer a type \(\gamma\) for e2
, and unify
\(\alpha\) with \(\gamma\). Then the result of type-inference is the
pair-type \(\tau = \alpha * \beta\).
3 Generalizing to tuples
Do Now!
What goes wrong for type-checking and/or type-inference of mutable general tuples?
The core ideas above carry through unchanged, in that tuples will have product types with more components in them than mere pairs, and the same problems arise as with immutable tuples: we don’t know the tuple sizes in advance. Therefore our tuple-setting expressions also must include the size of the tuple, just as tuple-accesses did.
4 Sequencing
Now that we have mutation, we need to reconsider the ergonomics of our language. It’s rare that assigning to a field of a tuple should be the only thing we want to compute: we likely want to mutate a field and keep going with our computation. These mutations therefore fit better into our language as statements to be executed for their side-effects, rather than as expressions to be evaluated for their answer. To achieve this, we might want to express the sequencing of multiple expressions, such that our program evaluates them all in order, but only returns the final result. We can add such concrete syntax easily:
Do Now!
How might we implement support for this feature? Which phases of the compiler necessarily change, and which could we avoid changing?
We’ll start by adding an ESeq
constructor to our expressions. We have two
options for how to proceed from here.
4.1 Approach 1: explicit support
Since we said that our language should evaluate both subexpressions and then evaluate to the result of the second one,
e1 ; e2
intuitively should mean the same thing as
let DONT_CARE = e1 in e2
Let’s try to push this expression through ANFing. We might start by copying
the code from ELet
, and then just “leaving out” the binding for the
intermediate variable:
let anf (p : tag program) : unit aprogram =
let rec helpC (e : tag expr) : (unit cexpr * (string * unit cexpr) list) =
match e with
| ESeq(e1, e2, tag) ->
let (e1_ans, e1_setup) = helpC e1 in
let (e2_ans, e2_setup) = helpC e2 in
(e2_ans, e1_setup @ [(* skip binding of DONT_CARE to e1_ans *)] @ e2_setup)
...
But unfortunately this translation is wrong: consider the program
print(1); print(2)
. Since the arguments to both calls are
immediate, there is no setup needed, so e1_setup = e2_setup = []
. Then we
return just the answer of e2
, and have lost the context for evaluating
e1
! The solution is to require that e1
be translated via helpI
,
rather than helpC
: by requiring that our answer be immediate, we guarantee
that the expression has been fully evaluated and the answer is placed into
RAX
; we can then neglect to save that value anywhere.
4.2 Approach 2: desugaring
Rather than create an explicit expression form, perhaps we could reuse the
existing ELet
form and use our intuitive definition as the actual
definition. We can enhance our desugaring function from
last time to look for uses of ESeq
and
rewrite them into uses of ELet
. This approach says that sequencing is merely
syntactic sugar, a convenience that makes programs nicer to write, but
that doesn’t add any new expressive power.
let desugar (p : tag program) : tag program =
let rec helpE (e : tag expr) =
match e with
| ESeq(e1, e2, tag) ->
let newname = gensym "DONT_CARE" in
ELet([(newname, helpE e1, tag)], helpE e2, tag)
| ...
This pass must recur throughout the program, as all of our passes do, and
rewrite any sequence expressions it encounters. The advantage of this approach
is that our subsequent passes can completely ignore the new sequencing
expression form (they can throw an InternalCompilerException
instead), and
therefore no other part of our language is affected.
We can clean this up still further, by adding a new kind of binding form,
BBlank
, that acts like the underscore binding in ML: it indicates that we
want to evaluate the right-hand expression, but not bother to bind it to any
name.
type 'a bind =
| BName of string * 'a typ * 'a
| BTuple of 'a bind list * 'a
| BBlank of 'a typ * 'a
type 'a binding = ('a bind * 'a expr * 'a)
Our desugaring now will generate a BBlank
binding:
let desugar (p : tag program) : tag program =
let rec helpE (e : tag expr) =
match e with
| ESeq(e1, e2, tag) ->
ELet([(BBlank(TyBlank tag, tag), helpE e1, tag)], helpE e2, tag)
| ...
Lastly, we need to add support to ANF to handle these new kinds of bindings,
taking note of the difficulty we encountered above and being careful to use
helpI
as needed.
let anf (p : tag program) : unit aprogram =
let rec helpC (e : tag expr) : (unit cexpr * (string * unit cexpr) list) =
match e with
| ELet((BName(name, _, _), exp, _)::rest, body, pos) ->
let (exp_ans, exp_setup) = helpC exp in
let (body_ans, body_setup) = helpC(ELet(rest, body, pos)) in
(body_ans, exp_setup @ [(name, exp_ans)] @ body_setup)
| ELet((BBlank _, exp, _)::rest, body, pos) ->
let (exp_ans, exp_setup) = helpI exp in (* NOTE: use of helpI *)
let (body_ans, body_setup) = helpC(ELet(rest, body, pos)) in
(body_ans, exp_setup @ body_setup) (* NOTE: no binding for exp_ans *)
| ...
and helpI (e : tag expr) : (unit immexpr * (stirng * unit cexpr) list) =
match e with
| ELet((BName(bind, _, _), exp, _)::rest, body, pos) ->
let (exp_ans, exp_setup) = helpC exp in
let (body_ans, body_setup) = helpI (ELet(rest, body, pos)) in
(body_ans, exp_setup @ [(bind, exp_ans)] @ body_setup)
| ELet((BBlank(_, _), exp, _)::rest, body, pos) ->
let (exp_ans, exp_setup) = helpI exp in (* NOTE: use of helpI *)
let (body_ans, body_setup) = helpI (ELet(rest, body, pos)) in
(body_ans, exp_setup @ body_setup) (* NOTE: no binding for exp_ans *)
| ...
...
Do Now!
Go back through the standard recipe for adding a feature to our language, and add support for underscore-bindings as needed. Where else might underscores appear, if we treat them as bindings?
This effectively combines the two approaches above: we desugar
ESeq
to ELet
using a BBlank
binding, then our general-purpose
ANF transformation handles the ignorable underscore-binding and produces
exactly the results we want, in a general manner that allows us to use
underscores in a flexible manner throughout our code.
Exercise
Unfortunately, using either of these two approaches, the rest of our compiler is unchanged, and is correct, yet programs that use sequencing might still not compile. Why? What might be done about this? Is it a failure in the design of sequencing, or a consequence of multiple language features interacting?
5 Recursive types, nil
, and “real” programs
Now that we have mutable tuples, and tuples can contain other tuples, we have the ability to work with arbitrarily large data. However, our static type system is too weak to allow us to do this! Suppose we wanted to encode a simple linked list of integers. We need some value to indicate the end of the list, but what should its type be?
In the untyped (or dynamically-typed) case, we could encode a linked list as
pairs whose first items are the elements of the list, whose second items are
the rest of the list, and where the end of the list is false
. For
instance:
def length(l):
if istuple(l): 1 + length(l[1 of 2])
else 0
But there currently is no way for us to type-check this function. We would
have to specify the type of l
as a nested tuple type, and then that
type would implicitly have a finite size. We need a recursive type to
indicate arbitrarily long lists. But such types need a base case, and
false
already is a boolean, so we need some new value to indicate
our base.
We will resolve this for our languages by introducing two features at once:
recursive type synonyms, and a unique new value nil
. By way
of examples:
type intlist = (int * intlist)
def length(l : intlist):
if l == (nil : intlist): 0
else: 1 + length(l[1 of 2])
We’ll define type synonyms as giving a name to a particular tuple type, where the name itself is in scope and permitted to be used within that tuple. We’ll collect these type synonyms in an environment, for use during type checking or type inference.
Additionally, we will add a typing rule that says nil
is permitted
to have any tuple type at all. (However, to keep type checking and type
inference easier, we will mandate that all uses of nil
are
syntactically annotated with their intended type.)
To represent nil
, we will simply choose a NULL
pointer from C,
but tag it as a tuple. This will allow us to easily evaluate
istuple(nil) == true
, while also ensuring that nil
is
guaranteed to be distinct from any valid tuple value in our language.
To use type synonyms in our language, we will simply add a case to unify
:
if we encounted a TyCon
during unification, we look it up in our
environment of type synonyms, and continue unification with the expanded
version of the type. This rule is suspiciously simple, and in a more
complicated system than ours, it would not suffice (and might cause the type
inference algorithm to loop forever), but for our purposes it works just fine.
Do Now!
Now that we’ve added
nil
, what new problems do we have to handle in our compilation of tuples everywhere? Do it.
We know that nil
is a
very
expensive mistake, and that better solutions exist. A full treatment of
algebraic data types, such as OCaml’s type definitions, would allow us to
support recursive type definitions without resorting to introducing
nil
values. But that approach is beyond the scope of this course.
Exercise
Now that we have recursive, mutable data, write a few examples of programs that do something interesting, like sorting a list in place, or building a binary tree, etc. Rejoice in your newfound mastery over simple programs! What features should we add to our language next?
1This is known as a may-alias problem, and aliasing analyses of various precision are at the core of most optimizing compilers.