Assignment 5: Diamondback: Defining functions
Due: Thu 02/14 at 8:59pm
git clone
In this assignment, you’ll implement a compiler for a small language with functions declarations and function calls, conforming to the C stack layout. You’ll also add some static well-formedness checks to the compiler.
As for the project acronym, so far our greatest minds have come up with
Double-word
Intel Architecture
MOstly dynamic
Nested-expression
(Diamondback supports recursion)
Boolean-tagged
ANF-transformed
Compiler.
...K?
1 The Diamondback Language
As usual, we have concrete and abstract syntaxes, along with a specification of semantics.
1.1 Concrete Syntax
The major addition to Diamondback are function declarations. Our programs are now a sequence of zero or more function declarations, followed by a single main expression.
‹program› ‹decls› ‹expr› ‹expr› ‹binop-expr› IDENTIFIER NUMBER true false ‹prim1› ( ‹expr› ) ‹expr› ‹prim2› ‹expr› IDENTIFIER ( ‹exprs› ) IDENTIFIER ( ) true false ( ‹expr› ) ‹prim1› add1sub1 ! printprintStack isboolisnum ‹prim2› +-* <><=>= == &&|| ‹decls› ‹decl› ‹decl› ‹decls› ‹decl› def IDENTIFIER ( ‹ids› ) : ‹expr› def IDENTIFIER ( ) : ‹expr› ‹ids› IDENTIFIER IDENTIFIER , ‹ids› ‹expr› let ‹bindings› in ‹expr› if ‹expr› : ‹expr› else: ‹expr› ‹binop-expr› ‹exprs› ‹expr› ‹expr› , ‹exprs› ‹bindings› IDENTIFIER = ‹expr› IDENTIFIER = ‹expr› , ‹bindings›
Diamondback also supports line comments, beginning with a #
and
continuing to the end of the line.
The other addition is function applications, which are written
IDENTIFIER(‹exprs›)
, for
example f(1, 2, 3)
. This is the syntax for a call to a function.
Note that primops and function calls are whitespace-sensitive: there can be no
space before the left parenthesis. (This is a workaround for a grammatical
ambiguity in the language. You can see a remnant of that ambiguity by the "1
shift/reduce conflict" warning produced by ocamlyacc
in the provided
code.)
1.2 Abstract Syntax
In this assignment, we distinguish between the surface syntax that the parser
produces (expr
) from the ANF expression forms (aexpr
). All these
types are given in exprs.ml
; below we show a simplified, undecorated form:
type prim1 =
| Add1
| Sub1
| Not
| Print
| IsNum
| IsBool
type prim2 =
| Plus
| Minus
| Times
| And
| Or
| Greater
| GreaterEq
| Less
| LessEq
| Eq
(* Surface expressions *)
type bind = (string * expr)
type expr =
| ELet of bind list * expr
| EPrim1 of prim1 * expr
| EPrim2 of prim2 * expr * expr
| EApp of string * expr list
| EIf of expr * expr * expr
| ENumber of int
| EBool of bool
| EId of string
type decl =
| DFun of string * string list * expr
type program =
| Program of decl list * expr
(* Internal ANF expressions *)
type immexpr =
| ImmNumber of int
| ImmBool of bool
| ImmId of string
and cexpr =
| CPrim1 of prim1 * immexpr
| CPrim2 of prim2 * immexpr * immexpr
| CApp of string * immexpr list
| CIf of immexpr * aexpr * aexpr
| CImmExpr of immexpr
and aexpr =
| ALet of string * cexpr * aexpr
| ACExpr of cexpr
and adecl =
| ADFun of string * string list * aexpr
and aprogram =
| AProgram of adecl list * aexpr
The EApp
and CApp
constructors correspond to function applications, and
the decl
and adecl
types respresent function declarations. A program is a
list of decls
(surface) or adecls
(ANF), with a main expression.
1.3 Semantics
There are several distinguishing features of Diamondback. The first is function applications. A function application should give the answer we’d get if we followed the rules for substituting argument values for parameter names. So, for example:
def f(x, y):
x + y
f(1, 2)
should produce 3.
Your compiler should use the rules for C stacks discussed in class and at this assembly guide to implement this behavior.
There are a number of new errors that can occur now that we have function declarations and calls. Your implementation should catch all of these cases statically; that is, before the program runs:
A function application with the wrong number of arguments should signal an
Arity
errorA function application of a non-existent function should signal an
UnboundFun
errorAn identifier without a corresponding binding location should report an
UnboundId
errorA let binding with duplicate names should report a
DuplicateId
errorA let binding that shadows an existing name should report a
ShadowId
errorA function declaration with duplicate names in the argument list should report a
DuplicateId
errorIf there are multiple function definitions with the same name, report a
DuplicateFun
errorIf a numeric constant is too large (as discussed in class), report an
Overflow
error
Again, these errors should stop the program from compiling, not happen
at runtime. Moreover, you should report all the errors in this
program, not just the first one encountered. See the notes on is_well_formed
below for implementation details.
These errors, in addition to ParseError
s, are all defined in
errors.ml
. Additionally, there are two more errors given to you:
NotYetImplemented
errors, which you should obviously remove and implement;
and InternalCompilerError
s, which indicate intended-to-be-impossible
scenarios in your compiler. Ideally, your final compiler should not need these
(either by coming up with useful semantics in all scenarios, or by using the
ML type system to your advantage to avoid any such unplanned scenarios, but in
practice you might still need them.
1.4 Implementation
The file assembly.ml
gives you all the assembly instructions you should
need, though you’re free to add your own new instructions if you want.
There are a few new pieces to the implementation:
A new structure for ANF. You should study and understand it (see Lecture 8: Revisiting ANF: Encoding A-Normal Form with Types) so you can implement the
EApp
case. This is the first thing you should add.A new
is_well_formed
function, which is called prior to performing ANF and are used to report the errors above. This function either returns the program unchanged, or returns anexn list
, which represents all the errors that are found in a program or expression. So a program likedef f(x, x): y f(1)
would report three errors, one for
y
being unbound, one for duplicatedx
parameters, and one for the arity mismatch on the call tof
.These errors can be reported in any order; in general (and in grading), it’s easy to test for one at a time. But reporting multiple errors makes using the
main
of the compiler much more pleasant, and is a nice view into the kinds of compiler ergonomics we should expect from a modern compiler.The
AProgram
structure, which contains both function declarations and the expression forour_code_starts_here
. You will probably want to generate an assembly structure that looks like:;; extern and global stuff section .text ... fun_decl1: ;; code for fun_decl1, including stack management fun_decl2: ;; code for fun_decl2, including stack management ... our_code_starts_here: ;; main entrypoint, as before, with stack management ;; errors, as before internal_error_non_number: ...
(Hint: while the
AProgram
construct is not quite uniform, since the body is treated separately from the function definitions, a relatively simple observation makes generating the output assembly of the entire program very uniform...)
1.5 Tail calls
Refine your compilation of function applications to support tail-calls. There are three levels of complexity here:
6410: required; 4410: optional. Support self-recursive tail-calls.
6410: required; 4410: optional. Support arbitrary tail-calls, when the number of arguments is the same between caller and callee.
Everyone: Optional. Support arbitrary tail-calls, when the number of arguments is different between caller and callee.
To support the last one, you’ll need to be creative with the calling
convention. The only truly essential feature of a stack frame is knowing where
the saved ESP
, EBP
and return addresses must be, and all other
variables’ locations can be calculated from knowing those. In particular,
you’ll need to change the ordering of values in your stack frame, where
EBP
points between parameters and locals. Instead, both parameters and
locals will need to be above EBP
: change your compilation
of function calls —call
instruction (why?), so you’ll need to come up with something clever to save
the desired return address. Come see me if you try this, for a hint.
To detect that you’ve implemented tail calls correctly, you may want to attempt the following extra credit.
1.6 Extra credit: Printing stack traces
Implement a new prim1
operator, printStack
, that takes a
single argument and returns it, just like print
does. What makes
printStack
special is its output: it will print out a stack trace
of all the call frames from wherever it’s called, down to wherever
our_code_starts_here
began. For example, given the following program:
def fact(acc, n):
if n < 1: printStack(acc)
else: fact(acc * n, n - 1) + 0
# NOTE: This + 0 is to prevent tail-recursion in this example
fact(1, 3)
ESP: 0xffa4ee40 ==> -2984148
EBP: 0xffa4ee58 ==> -2984132
(difference: -6)
Requested return val: 0xc ==> 6
Num args: 2
0xffa4ee40: 0xffa4ee58 ==> old ebp
0xffa4ee44: 0x00000002 ==> 1
0xffa4ee48: 0000000000 ==> 0
0xffa4ee4c: 0000000000 ==> 0
0xffa4ee50: 0000000000 ==> 0
0xffa4ee54: 0xffffffff ==> true
EBP 0xffa4ee58: 0xffa4ee78 ==> old ebp
0xffa4ee5c: 0x08048c44 ==> saved ret
0xffa4ee60: 0x0000000c ==> 6
0xffa4ee64: 0000000000 ==> 0
0xffa4ee68: 0000000000 ==> 0
0xffa4ee6c: 0000000000 ==> 0
0xffa4ee70: 0x0000000c ==> 6
0xffa4ee74: 0x7fffffff ==> false
EBP 0xffa4ee78: 0xffa4ee98 ==> old ebp
0xffa4ee7c: 0x08048c44 ==> saved ret
0xffa4ee80: 0x0000000c ==> 6
0xffa4ee84: 0x00000002 ==> 1
0xffa4ee88: 0000000000 ==> 0
0xffa4ee8c: 0x00000002 ==> 1
0xffa4ee90: 0x0000000c ==> 6
0xffa4ee94: 0x7fffffff ==> false
EBP 0xffa4ee98: 0xffa4eeb8 ==> old ebp
0xffa4ee9c: 0x08048c44 ==> saved ret
0xffa4eea0: 0x00000006 ==> 3
0xffa4eea4: 0x00000004 ==> 2
0xffa4eea8: 0000000000 ==> 0
0xffa4eeac: 0x00000004 ==> 2
0xffa4eeb0: 0x00000006 ==> 3
0xffa4eeb4: 0x7fffffff ==> false
EBP 0xffa4eeb8: 0xffa4eec8 ==> old ebp
0xffa4eebc: 0x08048c7b ==> saved ret
0xffa4eec0: 0x00000002 ==> 1
0xffa4eec4: 0x00000006 ==> 3
BOT 0xffa4eec8: 0xffa4eef8 ==> old ebp
0xffa4eecc: 0x08048b74 ==> saved ret
0xffa4eed0: 0x1 ==> heap
6
The output shows:
The current values of
ESP
andEBP
at the moment of the theprintStack
call. The difference between them should be the size of the current stack frame.The requested return value is the argument given to
printStack
.Num args
shows the number of arguments passed to the current function – i.e., the two arguments given tofact
.The next bunch of segments show individual stack frames. The three columns show the memory address, the value at that address, and then the result of
print
ing that address as a Diamondback value, or identifying it as the savedEBP
and return address. Each line labelledEBP
indicates the beginning of a new stack frame: the value at each of those lines contains the address of the nextEBP
line.Looking at these stack frames, from bottom to top, you can see: the boolean result of
n < 1
, local valuesacc * n
andn - 1
, a slot for the return value of thefact(...)
call (which is still zero because the call hasn’t finished yet), then the argumentsn
andacc
for the next call.The final segment, marked
BOT
, is the stack frame ofmain
itself.The last line of output is the final output of the entire program, and comes from the
printf
inmain()
.
Hint: To identify where to stop printing the stack (after all, main
isn’t the very last thing in the stack; there are operating-system frames as
well...), I used a trick: declare a global, non-const
variable in C,
int* STACK_BOTTOM
, and mark it as extern
. At the beginning of
our_code_starts_here
, initialize that value with the current value of
EBP
(you’ll need a new assembly arg
for this), and then you’re
guaranteed to know where your stack ends, dynamically.
Note: if you’ve properly implemented tail calls, then eliminating the
+ 0
from the program above, and rerunning it, should produce a
much shorter stack trace...
Note: If you implement this extra credit, provide a brief, written
explanation (in a comment in main.c
, near whatever new functions you
define) of why this must be implemented as a EPrim1
and not as a
EApp
expression.
1.7 Testing
There is one new testing function provided, tvg
. This works just like t
,
except it runs valgrind
on the executable it creates, and checks whether or
not it reports errors. If valgrind
does report memory errors, the test
fails and the errors are reported. The test succeeds if there are no memory
errors and the answer is correct.
You can test the well-formedness errors using the error tester as usual.
2 Running main
Running your own programs is the same as with Cobra, except you’ll give them
the .diamond
file extension.
3 List of Deliverables
your
compile.ml
any additional modules you saw fit to write
any starter modules that you modifed
tests in an OUnit test module (
test.ml
)any test input programs (
input/*.diamond
files)any additional files that are needed to build your tests (e.g.
main.c
, etc)
Again, please ensure the makefile builds your code properly. The black-box tests will give you an automatic 0 if they cannot compile your code!
DO NOT SUBMIT YOUR .git
DIRECTORY! For that matter, don’t submit
your output
or _build
directories.
4 Grading Standards
For this assignment, you will be graded on
Whether your code implements the specification (functional correctness),
the clarity and cleanliness of your code, and
the comprehensiveness of your test coverage
5 Submission
Wait! Please read the assignment again and verify that you have not forgotten anything!
Please submit your homework to https://handins.ccs.neu.edu/ by the above deadline.