Assignment 5: Diamondback: Defining functions

7.2

Assignment 5: Diamondback: Defining functions

Due: Thu 02/14 at 8:59pm

git clone

In this assignment, you’ll implement a compiler for a small language with functions declarations and function calls, conforming to the C stack layout. You’ll also add some static well-formedness checks to the compiler.

As for the project acronym, so far our greatest minds have come up with

Double-word
Intel Architecture
MOstly dynamic
Nested-expression
(Diamondback supports recursion)
Boolean-tagged
ANF-transformed
Compiler.
...K?

1 The Diamondback Language

As usual, we have concrete and abstract syntaxes, along with a specification of semantics.

1.1 Concrete Syntax

The major addition to Diamondback are function declarations. Our programs are now a sequence of zero or more function declarations, followed by a single main expression.

‹program›: | ‹decls› ‹expr› | ‹expr› ‹binop-expr›: | IDENTIFIER | NUMBER | true | false | ‹prim1› ( ‹expr› ) | ‹expr› ‹prim2› ‹expr› | IDENTIFIER ( ‹exprs› ) | IDENTIFIER ( ) | true | false | ( ‹expr› ) ‹prim1›: | add1 | sub1 | ! | print | printStack | isbool | isnum ‹prim2›: | + | - | * | < | > | <= | >= | == | && | || ‹decls›: | ‹decl› | ‹decl› ‹decls› ‹decl›: | def IDENTIFIER ( ‹ids› ) : ‹expr› | def IDENTIFIER ( ) : ‹expr› ‹ids›: | IDENTIFIER | IDENTIFIER , ‹ids› ‹expr›: | let ‹bindings› in ‹expr› | if ‹expr› : ‹expr› else: ‹expr› | ‹binop-expr› ‹exprs›: | ‹expr› | ‹expr› , ‹exprs› ‹bindings›: | IDENTIFIER = ‹expr› | IDENTIFIER = ‹expr› , ‹bindings›

Diamondback also supports line comments, beginning with a # and continuing to the end of the line.

The other addition is function applications, which are written IDENTIFIER(‹exprs›), for example f(1, 2, 3). This is the syntax for a call to a function. Note that primops and function calls are whitespace-sensitive: there can be no space before the left parenthesis. (This is a workaround for a grammatical ambiguity in the language. You can see a remnant of that ambiguity by the "1 shift/reduce conflict" warning produced by ocamlyacc in the provided code.)

1.2 Abstract Syntax

In this assignment, we distinguish between the surface syntax that the parser produces (expr) from the ANF expression forms (aexpr). All these types are given in exprs.ml; below we show a simplified, undecorated form:

type prim1 =
  | Add1
  | Sub1
  | Not
  | Print
  | IsNum
  | IsBool

type prim2 =
  | Plus
  | Minus
  | Times
  | And
  | Or
  | Greater
  | GreaterEq
  | Less
  | LessEq
  | Eq

(* Surface expressions *)
type bind = (string * expr)

type expr =
  | ELet of bind list * expr
  | EPrim1 of prim1 * expr
  | EPrim2 of prim2 * expr * expr
  | EApp of string * expr list
  | EIf of expr * expr * expr
  | ENumber of int
  | EBool of bool
  | EId of string

type decl =
  | DFun of string * string list * expr

type program =
  | Program of decl list * expr


(* Internal ANF expressions *)
type immexpr =
  | ImmNumber of int
  | ImmBool of bool
  | ImmId of string

and cexpr =
  | CPrim1 of prim1 * immexpr
  | CPrim2 of prim2 * immexpr * immexpr
  | CApp of string * immexpr list
  | CIf of immexpr * aexpr * aexpr
  | CImmExpr of immexpr

and aexpr =
  | ALet of string * cexpr * aexpr
  | ACExpr of cexpr

and adecl =
  | ADFun of string * string list * aexpr

and aprogram =
  | AProgram of adecl list * aexpr

The EApp and CApp constructors correspond to function applications, and the decl and adecl types respresent function declarations. A program is a list of decls (surface) or adecls (ANF), with a main expression.

1.3 Semantics

There are several distinguishing features of Diamondback. The first is function applications. A function application should give the answer we’d get if we followed the rules for substituting argument values for parameter names. So, for example:

def f(x, y):
  x + y

f(1, 2)

should produce 3.

Your compiler should use the rules for C stacks discussed in class and at this assembly guide to implement this behavior.

There are a number of new errors that can occur now that we have function declarations and calls. Your implementation should catch all of these cases statically; that is, before the program runs:

A function application with the wrong number of arguments should signal an Arity error
A function application of a non-existent function should signal an UnboundFun error
An identifier without a corresponding binding location should report an UnboundId error
A let binding with duplicate names should report a DuplicateId error
A let binding that shadows an existing name should report a ShadowId error
A function declaration with duplicate names in the argument list should report a DuplicateId error
If there are multiple function definitions with the same name, report a DuplicateFun error
If a numeric constant is too large (as discussed in class), report an Overflow error

Again, these errors should stop the program from compiling, not happen at runtime. Moreover, you should report all the errors in this program, not just the first one encountered. See the notes on is_well_formed below for implementation details.

These errors, in addition to ParseErrors, are all defined in errors.ml. Additionally, there are two more errors given to you: NotYetImplemented errors, which you should obviously remove and implement; and InternalCompilerErrors, which indicate intended-to-be-impossible scenarios in your compiler. Ideally, your final compiler should not need these (either by coming up with useful semantics in all scenarios, or by using the ML type system to your advantage to avoid any such unplanned scenarios, but in practice you might still need them.

1.4 Implementation

The file assembly.ml gives you all the assembly instructions you should need, though you’re free to add your own new instructions if you want.

There are a few new pieces to the implementation:

A new structure for ANF. You should study and understand it (see Lecture 8: Revisiting ANF: Encoding A-Normal Form with Types) so you can implement the EApp case. This is the first thing you should add.
A new is_well_formed function, which is called prior to performing ANF and are used to report the errors above. This function either returns the program unchanged, or returns an exn list, which represents all the errors that are found in a program or expression. So a program like
```
def f(x, x):
  y
f(1)
```
would report three errors, one for y being unbound, one for duplicated x parameters, and one for the arity mismatch on the call to f.
These errors can be reported in any order; in general (and in grading), it’s easy to test for one at a time. But reporting multiple errors makes using the main of the compiler much more pleasant, and is a nice view into the kinds of compiler ergonomics we should expect from a modern compiler.
The AProgram structure, which contains both function declarations and the expression for our_code_starts_here. You will probably want to generate an assembly structure that looks like:
```
  ;; extern and global stuff
section .text
...
fun_decl1:
  ;; code for fun_decl1, including stack management
fun_decl2:
  ;; code for fun_decl2, including stack management
...
our_code_starts_here:
  ;; main entrypoint, as before, with stack management

  ;; errors, as before
internal_error_non_number:
...
```
(Hint: while the AProgram construct is not quite uniform, since the body is treated separately from the function definitions, a relatively simple observation makes generating the output assembly of the entire program very uniform...)

1.5 Tail calls

Refine your compilation of function applications to support tail-calls. There are three levels of complexity here:

6410: required; 4410: optional. Support self-recursive tail-calls.
6410: required; 4410: optional. Support arbitrary tail-calls, when the number of arguments is the same between caller and callee.
Everyone: Optional. Support arbitrary tail-calls, when the number of arguments is different between caller and callee.

To support the last one, you’ll need to be creative with the calling convention. The only truly essential feature of a stack frame is knowing where the saved ESP, EBP and return addresses must be, and all other variables’ locations can be calculated from knowing those. In particular, you’ll need to change the ordering of values in your stack frame, where EBP points between parameters and locals. Instead, both parameters and locals will need to be above EBP: change your compilation of function calls — and anything else that is affected! — to support this new convention. This change will prevent you from using the call instruction (why?), so you’ll need to come up with something clever to save the desired return address. Come see me if you try this, for a hint.

To detect that you’ve implemented tail calls correctly, you may want to attempt the following extra credit.

1.6 Extra credit: Printing stack traces

Implement a new prim1 operator, printStack, that takes a single argument and returns it, just like print does. What makes printStack special is its output: it will print out a stack trace of all the call frames from wherever it’s called, down to wherever our_code_starts_here began. For example, given the following program:

def fact(acc, n):
  if n < 1: printStack(acc)
  else: fact(acc * n, n - 1) + 0
  # NOTE: This + 0 is to prevent tail-recursion in this example

fact(1, 3)

Here is a potential stack trace:

ESP: 0xffa4ee40	==>  -2984148
EBP: 0xffa4ee58	==>  -2984132
(difference: -6)
Requested return val: 0xc ==> 6
Num args: 2
    0xffa4ee40: 0xffa4ee58	==>  old ebp
    0xffa4ee44: 0x00000002	==>  1
    0xffa4ee48: 0000000000	==>  0
    0xffa4ee4c: 0000000000	==>  0
    0xffa4ee50: 0000000000	==>  0
    0xffa4ee54: 0xffffffff	==>  true
EBP 0xffa4ee58: 0xffa4ee78	==>  old ebp
    0xffa4ee5c: 0x08048c44	==>  saved ret
    0xffa4ee60: 0x0000000c	==>  6
    0xffa4ee64: 0000000000	==>  0
    0xffa4ee68: 0000000000	==>  0
    0xffa4ee6c: 0000000000	==>  0
    0xffa4ee70: 0x0000000c	==>  6
    0xffa4ee74: 0x7fffffff	==>  false
EBP 0xffa4ee78: 0xffa4ee98	==>  old ebp
    0xffa4ee7c: 0x08048c44	==>  saved ret
    0xffa4ee80: 0x0000000c	==>  6
    0xffa4ee84: 0x00000002	==>  1
    0xffa4ee88: 0000000000	==>  0
    0xffa4ee8c: 0x00000002	==>  1
    0xffa4ee90: 0x0000000c	==>  6
    0xffa4ee94: 0x7fffffff	==>  false
EBP 0xffa4ee98: 0xffa4eeb8	==>  old ebp
    0xffa4ee9c: 0x08048c44	==>  saved ret
    0xffa4eea0: 0x00000006	==>  3
    0xffa4eea4: 0x00000004	==>  2
    0xffa4eea8: 0000000000	==>  0
    0xffa4eeac: 0x00000004	==>  2
    0xffa4eeb0: 0x00000006	==>  3
    0xffa4eeb4: 0x7fffffff	==>  false
EBP 0xffa4eeb8: 0xffa4eec8	==>  old ebp
    0xffa4eebc: 0x08048c7b	==>  saved ret
    0xffa4eec0: 0x00000002	==>  1
    0xffa4eec4: 0x00000006	==>  3
BOT 0xffa4eec8: 0xffa4eef8	==>  old ebp
    0xffa4eecc: 0x08048b74	==>  saved ret
    0xffa4eed0: 0x1	==>  heap
6

The output shows:

The current values of ESP and EBP at the moment of the the printStack call. The difference between them should be the size of the current stack frame.
The requested return value is the argument given to printStack.
Num args shows the number of arguments passed to the current function – i.e., the two arguments given to fact.
The next bunch of segments show individual stack frames. The three columns show the memory address, the value at that address, and then the result of printing that address as a Diamondback value, or identifying it as the saved EBP and return address. Each line labelled EBP indicates the beginning of a new stack frame: the value at each of those lines contains the address of the next EBP line.
Looking at these stack frames, from bottom to top, you can see: the boolean result of n < 1, local values acc * n and n - 1, a slot for the return value of the fact(...) call (which is still zero because the call hasn’t finished yet), then the arguments n and acc for the next call.
The final segment, marked BOT, is the stack frame of main itself.
The last line of output is the final output of the entire program, and comes from the printf in main().

Hint: To identify where to stop printing the stack (after all, main isn’t the very last thing in the stack; there are operating-system frames as well...), I used a trick: declare a global, non-const variable in C, int* STACK_BOTTOM, and mark it as extern. At the beginning of our_code_starts_here, initialize that value with the current value of EBP (you’ll need a new assembly arg for this), and then you’re guaranteed to know where your stack ends, dynamically.

Note: if you’ve properly implemented tail calls, then eliminating the + 0 from the program above, and rerunning it, should produce a much shorter stack trace...

Note: If you implement this extra credit, provide a brief, written explanation (in a comment in main.c, near whatever new functions you define) of why this must be implemented as a EPrim1 and not as a EApp expression.

1.7 Testing

There is one new testing function provided, tvg. This works just like t, except it runs valgrind on the executable it creates, and checks whether or not it reports errors. If valgrind does report memory errors, the test fails and the errors are reported. The test succeeds if there are no memory errors and the answer is correct.

You can test the well-formedness errors using the error tester as usual.

2 Running main

Running your own programs is the same as with Cobra, except you’ll give them the .diamond file extension.

3 List of Deliverables

your compile.ml
any additional modules you saw fit to write
any starter modules that you modifed
tests in an OUnit test module (test.ml)
any test input programs (input/*.diamond files)
any additional files that are needed to build your tests (e.g. main.c, etc)

Again, please ensure the makefile builds your code properly. The black-box tests will give you an automatic 0 if they cannot compile your code!

DO NOT SUBMIT YOUR .git DIRECTORY! For that matter, don’t submit your output or _build directories.

4 Grading Standards

For this assignment, you will be graded on

Whether your code implements the specification (functional correctness),
the clarity and cleanliness of your code, and
the comprehensiveness of your test coverage

5 Submission

Wait! Please read the assignment again and verify that you have not forgotten anything!

Please submit your homework to https://handins.ccs.neu.edu/ by the above deadline.

1	The Diamondback Language
2	Running main
3	List of Deliverables
4	Grading Standards
5	Submission