Assignment 8: Egg-eater:   Tuples
1 Language
1.1 Syntax Additions and Semantics
1.2 Desugaring away unnecessary complexity
1.3 Semantics and Representation of Tuples
1.3.1 Tuple Heap Layout
1.3.2 Accessing Tuple Contents
1.3.3 General Heap Layout
1.3.4 Interaction with Existing Features
2 Approaching Reality
3 Recommended TODO List
4 List of Deliverables
5 Grading Standards
6 Submission
7.2

Assignment 8: Egg-eater: Tuples

Due: Wed 03/20 at 8:59pm

git clone

In this assignment you’ll extend Taipan to implement mutable tuples, whose syntax looks vaguely egg-shaped, if you don’t think about it too much...

(egg)

1 Language

Egg-eater starts with the same semantics as Taipan, and adds support for

The runtime system must add support for

This is a large assignment, and its pieces are tightly interconnected. Read through the whole assignment below carefully, then take note of the recommended TODO list at the bottom for a suggested order to tackle these pieces.

You will enhance whichever type system you started in Taipan to include support for tuples. Other than that preexisting distinction, this assigment is identical for CS4410 and CS6410 students.

1.1 Syntax Additions and Semantics

The main addition in Egg-eater is tuple expressions, along with accessor expressions for getting or setting the contents of tuples, and a unary primitive for checking if a value is a tuple. Tuple expressions are a series of zero or more comma-separated expressions enclosed in parentheses. (The syntax of one-element tuples is slightly odd, and requires a trailing comma; this is to avoid confusion with our existing use of parentheses for grouping infix expressions.) A tuple access expression is an expression followed by an integer literal enclosed in square brakcets, which expresses which field to be accessed. Finally, istuple is a primitive (like isnum and isbool) that checks for tuple-ness.

‹expr›: ... | ‹tuple› | ‹tuple-get› | ‹tuple-set› | istuple ( ‹expr› ) | ‹expr› ; ‹expr› ‹tuple›: | ( ) | ( ‹expr› , ) | ( ‹expr› , ‹expr› , ‹expr› , ... ‹expr› ) ‹tuple-get›: | ID [ NUM OF NUM ] | ( ID : ‹typ› ) [ NUM OF NUM ] | ‹tuple-get› [ NUM OF NUM ] ‹tuple-set›: | ID [ NUM OF NUM := ‹expr› ] | ( ID : ‹typ› ) [ NUM OF NUM := ‹expr› ] | ‹tuple-set› [ NUM OF NUM := ‹expr› ] ‹bind›: ... | _ | ( ‹binds› ) ‹binds›: | ‹bind› | ‹bind› , ‹binds› ‹typ›: ...

For example, we can create three tuples and access their fields:

let unit = () in
let one = (1,) in
let three = (3, 4, 5) in
three[0 of 3]

A tuple-set expression evaluates both arguments, updates the tuple at the appropriate index, and returns the entire tuple value as its result. We can therefore chain tuple-set expressions together, and write

let three = (0, 0, 0) in
three[0 of 3 := 1][1 of 3 := 2][2 of 3 := 3]
# Now three equals (1, 2, 3)

let pair = (0, 0) in
pair[0 of 2 := three[1 of 3 := 10]]
# Now three equals (1, 10, 3) and pair equals (1, three, 3)

We can also actively destructure tuples when we bind them:

let t = (3, ((4, true), 5)) in
let (x, (y, z)) = t
x + y[0 of 2] + z

Notice that we can destructure tuples “all the way down”, or “stop early”: in this case, y is bound to the tuple (4, true), or said another way, y == t[1 of 2][0 of 2].

In the expr datatype, these are represented as:

type 'a expr =
  ...
  | ETuple of 'a expr list * 'a
  | EGetItem of 'a expr * int * int * 'a
  | ESetItem of 'a expr * int * int * 'a expr * 'a
  | ESeq of 'a expr * 'a expr * 'a

type prim1 =
  ...
  | IsTuple

Important: The syntax for EGetItem and ESetItem are quite odd-looking. The restrictions here are added to make your task of type-checking and type-inference easier:

The parser will ensure that these well-formedness constraints are satisfied by user programs; if you choose to create programs that do not satisfy the constraint, then you’ll need to support that generality in your type system instead.

In ANF syntax, these expressions are represented as cexprs, with immexpr components:

type 'a cexpr =
  ...
  | CTuple of 'a immexpr list * 'a
  | CGetItem of 'a immexpr * int * 'a
  | CSetItem of 'a immexpr * int * 'a immexpr * 'a

We’re requiring integer literals in the index position, just as the concrete syntax did. Note that these expressions are all cexprs, and not immexprs – the allocation of a tuple counts as a “step” of execution, and so they are not themselves already values.

To make the bindings work in our AST, we need to enhance our representation of binding positions:

type 'a bind =
  | BName of string * 'a
  | BTuple of 'a bind list * 'a typ * 'a
  | BBlank of 'a typ * 'a

type 'a binding = ('a bind * 'a expr * 'a)

type 'a expr = ...
  | ELet of 'a binding list * 'a expr ' a

type 'a decl =
  | DFun of string * 'a bind list * 'a scheme * 'a expr * 'a

Let-bindings now can take an arbitrary, deeply-structured binding, rather than just simple names. Further, because we have mutation of tuples, these act more like statements than expressions, and so we may need to sequence multiple expressions together. Further still, sequencing of expressions acts just like let-binding the first expression and then ignoring its result, before executing the second expression. In other words, e1 ; e2 means the same thing as let _ = e1 in e2. To avoid making up a new name for the ignored value, we introduce BBlank bindings that indicate the programmer doesn’t need to store this value in a variable. Lastly, now that we have richer binding structure, we’re going to use it everywhere, including in function definitions:

def add-pairs((x1, y1), (x2, y2)):
  (x1 + x2, y1 + y2)

This should be treated as syntactic sugar for a similar function

def add_pairs(p1, p2):
  let (x1, y1) = p1, (x2, y2) = p2 in
  (x1 + x2, y1 + y2)

(Your solution will likely generate different names than p1 or p2.)

1.2 Desugaring away unnecessary complexity

The introduction of destructuring let-bindings, sequencing, and blanks all make the rest of compliation complicated. ANF’ing, stack-slot allocation, and compilation all are affected. We can translate this mess away, though, and avoid dealing with it further.

Nested let-bindings: Given a binding

let (b1, b2, ..., bn) = e in body

we can replace this entire expression with the simpler but more verbose

let temp_name1 = e,
    b1 = temp_name1[0 of n],
    b2 = temp_name1[1 of n],
    ...,
    bn = temp_name1[(n-1) of n]
in body

(Note that the (n-1) in the last binding is not an expression, but a compile-time constant literal integer, deduced solely from the length of the original binding expression.)

Nested function-argument bindings: Function arguments can now be nested bindings as well. The desugaring above almost works, except there’s no explicit e expression to bind to the tuple. Instead, you should generate a temporary argument name, and treat the argument bindings as being bound to those. The example of add_pairs above gives the intuition: it wraps the existing body of the function in these new let-bindings, which reduces the problem to solving nested let-bindings.

Sequences: You should implement a desugar phase of the compiler, which runs somewhere early in the pipeline and which makes subsequent phases easier, by implementing the translations described in this section.

1.3 Semantics and Representation of Tuples

1.3.1 Tuple Heap Layout

Tuples expressions should evaluate their sub-expressions in order from left to right, and store the resulting values on the heap. We discussed several possible representations in class for laying out tuples on the heap; the one you should use for this assignment is:

image

That is, one word is used to store the count of the number of elements in the tuple, and the subsequent words are used to store the values themselves. Note that the count is an actual integer; it is not an encoded Egg-eater integer value.

A tuple value is stored in variables and registers as the address of the first word in the tuple’s memory, but with an additional 1 added to the value to act as a tag. So, for example, if the start address of the above memory were 0x0adadad0, the tuple value would be 0x0adadad1. With this change, we extend the set of tag bits to the following:

Visualized differently, the value layout is:

Bit pattern

     

Value type

0xWWWWWWW[www0]

     

Number

0xFFFFFFF[1111]

     

True

0x7FFFFFF[1111]

     

False

0xWWWWWWW[w001]

     

Tuple

Where W is a “wildcard” nibble and w is a “wildcard” bit.

1.3.2 Accessing Tuple Contents

In a tuple access expression, like

(6, 7, 8, 9)[1 of 2]

The behavior should be:

  1. Evaluate the expression in tuple position (before the brackets), then the index expression (the one inside the brackets).

  2. Check that the tuple position’s value is actually a tuple, and signal an error containing "expected tuple" if not.

  3. Check that the actual tuple length matches the expected length, and signal an error containing "incorrect length" if not.

  4. Check that the index number is a valid index for the tuple value—that is, it is between 0 and the stored number of elements in the tuple minus one. Signal an error containing "index too small" or "index too large" as appropriate.

  5. Evaluate to the tuple element at the specified index.

You can do this with just EAX, but it causes some minor pain. The register ECX has been added to the registers in instruction.ml feel free to generate code that uses both EAX and ECX in this case (for example saving the index in ECX and using EAX to store the address of the tuple). This can save a number of instructions. Note that we will generate code that doesn’t need to use ECX or EAX beyond the extent of this one expression, so there is no need to worry about saving or restoring the old value from ECX.

You also may want to use an extended syntax for mov in order to combine these values for lookup. For example, this kind of arithmetic is allowed inside mov instructions:

mov eax, [eax + ecx * 4]

This would access the memory at the location of eax, offset by the value of ecx * 4. So if the value in ecx were, say 2, this may be part of a scheme for accessing the first element of a tuple (there are other details you should think through here; this is not a complete solution) Feel free to add additional arg types in instruction.ml to support a broader range of mov instructions, if it helps.

Neither ECX nor anything beyond the typical RegOffset is required to make this work, but you may find it interesting to try different shapes of generated instructions.

1.3.3 General Heap Layout

The register ESI has been designated as the heap pointer. The provided main.c does a large malloc call, and passes in the resulting address as an argument to our_code_starts_here. The support code provided fetches this value (as a traditional argument), and stores it in ESI. It also does a bit of arithmetic to make sure that ESI starts at an 8-byte boundary — that is, the last three bits of ESI are 000. It is up to your code to ensure that:

The first point above means that for tuples that take up an odd amount of 4-byte words, ESI needs to leave some “dead space” in order to align with an 8-byte boundary. For example, assume before an allocation ESI is pointing at address 0x0000ada0:

image

And then we need to allocate the tuple (4, true). Since we need one word for the size (2) and one word each for the two values 4 and true, there are 3 words required to store the tuple. If we left the heap in this state:

image

Then we couldn’t use three tag bits when using 0x0000adac (the next allocation we’d need to perform), because the c part uses the third-least significant bit: 0xc == 0b1100, and this would conflict with our tagging strategy. Note that in this assignment, we might be able to get away with 4-byte word boundaries, but in future assignments we will use more tags, and need all three bits. So instead of the above resulting picture, ESI should be moved one word further:

image

The padding is unused space to make the heap allocation strategy with tagging work cleanly — this is certainly a place where you can think about some interesting tradeoffs (what are some of them?)

1.3.4 Interaction with Existing Features

Any time we add a new feature to a language, we need to consider its interactions with all the existing features. In the case of Egg-eater, that means considering:

We’ll take them one at a time.

2 Approaching Reality

With the addition of tuples, Egg-eater is dangerously close to a useful language. Of course, it still puts no control on memory limits, doesn’t have a module system, and has other major holes. However, since we have structured data, we can now, for instance, implement a linked list. We need to pick a value to represent empty either false or 0 will do in a pinch. Then we can write link, which creates a pair of the first with the next link:

def link(first, rest):
  (first, rest)

let mylist = link(1, link(2, link(3, false))) in
  mylist[0]

Now we can write some list functions:

def length(l):
  if l == false: 0
  else:
    1 + length(l[1 of 2])

Try building on this idea, and writing up a basic list library. Write at least sum, to add up a numeric list, append, which concatenates two lists, and reverse, which reverses a list. Hand them in in a file in the input/ directory called lists.egg. Remember that make output/lists.run will build the executable for this file.

Write more functions if you want, as well, and test them out.

3 Recommended TODO List

  1. Implement the ETuple, EGetItem and ESetItem cases in ANF. These should be relatively similar in structure to the other arbitrary-arity expression form, EApp...

  2. Get tuple creation and access working for tuples containing two elements, testing as you go. This is very similar to the pairs code from lecture.

  3. Modify the binary and unary operators to handle tuples appropriately (it may be useful to skip print at first). Test as you go.

  4. Make tuple creation and access work for tuples of any size. Test as you go.

  5. Tackle print for tuples if you haven’t already. Test as you go. Reimplement print as a global function provided from C, rather than as a built-in prim1.

  6. Implement input, a C function that prompts the user for simple input — i.e., numbers or booleans; no need to write your own tuple parser in C! — and returns it to the running program. For example, print(input()) should echo back whatever value the user entered.

  7. Write some list library functions (at least the three above) to really stress your tuple implementation. Rejoice in your implementation of the core features needed for nontrivial computation. (Well, aside from the pesky issue of running out of memory. More on that in lecture soon.)

  8. Implement content-equality rather than address-equality for tuples. And/or, try implementing something more ambitious than lists, like a binary search tree, in Egg-eater. This last point is ungraded, but quite rewarding!

A note on support code — a lot is provided, but you can feel free to overwrite it with your own implementation, if you prefer.

4 List of Deliverables

Again, please ensure the makefile builds your code properly. The black-box tests will give you an automatic 0 if they cannot compile your code!

DO NOT SUBMIT YOUR .git DIRECTORY! For that matter, don’t submit your output or _build directories.

5 Grading Standards

For this assignment, you will be graded on

6 Submission

Wait! Please read the assignment again and verify that you have not forgotten anything!

Please submit your homework to https://handins.ccs.neu.edu/ by the above deadline.

1We say that of all the tuple types, there is no most general unifier that could be used instead.