Assignment 4: Cobra: Multiple types of values
Due: Friday 02/07 at 9:00pm
git clone
In this compiler, you’ll deal with COded Binary RepresntAtions of values.
1 The Cobra Language
1.1 Concrete Syntax
The concrete syntax of Cobra is very similar to Boa, with a few new additions.
‹prim1› ... ! print isbool isnum ‹expr› ... true false ‹expr› < ‹expr› ‹expr› > ‹expr› ‹expr› <= ‹expr› ‹expr› >= ‹expr› ‹expr› == ‹expr› ‹expr› && ‹expr› ‹expr› || ‹expr›
1.2 Abstract Syntax
The abstract syntax is very similar to Boa, also:
type prim1 = ...
| Print
| IsBool
| IsNum
| Not
type prim2 = ...
| And
| Or
| Greater
| GreaterEq
| Less
| LessEq
| Eq
type 'a expr = ...
| EBool of bool * 'a
1.3 Semantics
The semantics of booleans are straightforward. The semantics of EIf
changes slightly: its condition must evaluate to a boolean, and it branches on
the truth or falsehood of that value, rather than whether it’s nonzero.
With the addition of two types to the language, there are two main changes that ripple through the implementation:
The representation of values
The possibility of errors
There is one other major addition, which is the print
primitive, discussed
more below.
The representation of values requires a definition. We’ll use the following representations for the Cobra runtime:
true
will be represented as the constant0xFFFFFFFFFFFFFFFF
. (NOTE: we used this value in class, but the lecture notes only specified the outermost bits; this is mandating that the middle bits all be set, too.)false
will be represented as the constant0x7FFFFFFFFFFFFFFF
. (NOTE: we used this value in class, but the lecture notes only specified the outermost bits; this is mandating that the middle bits all be set, too.)numbers will be represented with a zero in the rightmost bit, by shifting them left 1 bit. So, for example, 2 is represented as
0x0000000000000004
.
You should augment the provided print
function in main.c
to print
these values correctly: true
and false
should print as those words,
and numbers should print out as the underlying number being represented.
You should raise errors in the following cases:
-
,+
,*
,add1
andsub1
should raise an error (by printing it out) with the substring"arithmetic expected a number"
if the operation’s argument(s) are not numbers. You can print more information than this if you like, but the message must have at least this substring.<
,<=
,>
and>=
should raise an error with the substring"comparison expected a number"
if the arguments are not both numbers.+
,-
,*
,add1
andsub1
should raise an error with the substring"overflow"
if the result overflows, and falls outside the range representable in 63 bits. Thejo
instruction will be useful: it jumps if the last instruction overflowed.if
should raise an error with the substring"if expected a boolean"
if the conditional value is not a boolean.and
andor
should should short-circuit if their left argument determines the answer fully, without evaluating or tag-checking the right argument. Otherwise, they should raise an error with the substring"logic expected a boolean"
if the arguments are not both booleans.not
should raise an error with the substring"logic expected a boolean"
if its argument is not a boolean.==
should work polymorphically, regardless of the types of its inputs.
These error messages should be printed on standard error. The other operators never fail, and so never produce errors.
We add two new primitive operations, isbool
and isnum
. These two
operations have an effective type of Any -> Bool
, and will return
true
if the argument they receive is indeed a boolean or number,
respectively.
The last required primitive operation is print
, which prints its single
argument to the command line, and then returns it. The print
function in
main.c
explicitly returns its argument; you will need to retrieve and
use that value in your compiled output.
2 Examples
The expression
let x = 1 in let y = print(x + 1) in print(y + 2)
will output
2 4 4
The first 2 comes from the first
print
expression. The first 4 comes from the secondprint
expression. The final line prints the answer of the program as usual, so there’s an “extra” 4.The expression
if 54: true else: false
prints (on standard error) something like:
Error: if expected a boolean, got 54
3 Implementation strategies
3.1 Rendering errors
To display error messages on standard error, you’ll need to use a call something like:
fprintf(stderr, "Error: arithmetic expected a number");
I recommend that you design a function void error(int errCode)
in
main.c
, that handles all errors in uniform manner. Then, you should add
a suffix to your generated assembly (i.e., change what goes in
compile_anf_to_string
) that looks something like:
err_arith_not_num:
mov RDI, <whatever your error code for arithmetic-op-didn't-get-numbers is>
call error
You are welcome to make this signature more elaborate, to pass the mistaken value into the error handler so that it can be printed as part of the error message. We sketched this possibility out in class and in the lecture notes.
3.2 Memory Layout and Calling C Functions
In order to set up the stack properly to call C functions, like print
and your
error functions, it’s necessary to make a few changes to what we had in Boa.
Allocating stack space ahead of time: At the start of our generated code (which we now recognize is a function body participating in a C runtime stack), we need to make sure we make enough stack space for all variables we create, and reserve that space ahead of time. To do this, we move
RSP
to point to a location that is N words away (so N * 8 bytes for us), where N is the greatest number of variables we need at once. This is actually tricky to compute to be fully optimal (teaser for later in the semester: by “tricky” I mean NP-hard), but it’s easy to get an OK heuristic —we can compute the maximum depth of nested definitions. To do this, we need the
count_vars
function. This is a straightforward function and not very interesting right now, so I’ve provided it. You need to add instructions to compiled output in order to make sure the correct space is allocated on the stack by subtracting the right amount fromRSP
. You may choose to implement this via a simple increment ofRSP
, or you may choose to explicitlypush
an appropriate number of zeros onto the stack, to clear out any uninitialized values. (The former is more efficient; the latter might be helpful for debugging.)Using the Base Pointer: In addition, this means that all variable references need to happen from
RBP
rather than fromRSP
. This is becauseRSP
can change while we are pushing arguments onto the stack for other function calls, so RBP is the place we should trust for consistent offsets.Participating in the C stack: As a C function callee (from
main
) and caller (oferror
functions andprint
), our code has some responsibilities. First, we need to store the old base pointer upon entry, and update the base pointer to hold the current top of the stack (which includes the return pointer into main, for example). This is why the typical top two lines of most C functions are:push RBP mov RBP, RSP
Similarly, when we’re done with the function, we need to restore the stack pointer to its old location, and put the old base pointer value back. This is why the last lines before a
ret
in a C function are often:mov RSP, RBP pop RBP
Other Responsibilities: If we were using registers beyond
RAX
,RBP
, andRSP
, we’d be responsible for storing some of them as callee, and some as caller. But we’re not going to go into those details for this assignment. Since we aren’t using those registers, it has no effect on our code’s behavior.
If you write any functions in main.c
that you need to be available in
your assembly, you need to declare them in the assembly via:
;; In your generated assembly
extern <exported name of function>
They also need to be declared in main.c
such that the compiler won’t
mangle their names, via
// In main.c
extern <your function signature here> asm("exported name of function");
If you forget either of these, your code will not link correctly.
3.3 New Assembly Constructs
Sized
: You may run into errors that report that the size of an operation is ambiguous. This could happen if you write, for example:cmp [RBP-8], 0
because the assembler doesn’t know if the program should compare a four-byte zero, a one-byte zero, or something in between into memory starting at
[RBP-8]
. To solve this, you can supply a size:cmp [RBP-8], QWORD 0
The
QWORD
keyword tells the assembler to use the “quad word” size for 0, which corresponds to 64 bits. AWORD
corresponds to 16 bits, aBYTE
corresponds to 8 bits, and aDWORD
corresponds to 32 bits. To get a sized argument, you can use theSized
constructor fromarg
.HexConst
: Sometimes it’s nice to read things in hex notation as opposed to decimal constants. I’ve provided a newHexConst
arg that’s useful for this case, and augmentedarg_to_asm
for you to support it. It has the same meaning asConst
, and just prints out differently.IPush
,IPop
: These two instructions manage values on the stack.push
adds a value at the current location ofRSP
, and decrementsRSP
to point past the added value (remember, the stack grows toward numerically-smaller addresses, even though we draw it as growing upward in diagrams).pop
incrementsRSP
and moves the value at the locationRSP
was pointing to into the provided arg.ICall
: Acall
does two things:Pushes the next code location onto the stack (just like a
push
), which becomes the return pointerPerforms an unconditional jump to the provided label
Note that
call
does not affectRBP
, which the program must maintain on its own.IShr
,IShl
,ISar
: Bit shifting operations.sar
(shift-arithmetic-right) preserves the sign bit of the value being shifted, whileshr
(shift-right) simply fills in a zero bit.IAnd
,IOr
,IXor
: Bitwise logical operations.ITest
: Performs a bitwise-and of the two arguments, and records whether the result was zero, signed, or overflowed. This can be used in tandem with jumps likejo
orjz
.IJo
,IJno
: Jump to the provided label if the last arithmetic operation did/did not overflowILineComment
andIInstrComment
: these allow you to add a comment on a line by itself, or in-line after an instruction. These might be helpful for readability of your generated output, while you’re debugging.
3.4 Some software engineering considerations
Our compiler is getting sufficiently sophisticated that the pipeline is several
stages long already. Bugs can occur in almost any stage of the pipeline. As a
result, one very useful debugging trick is to engineer the compiler to
save each stage of the pipeline, and print them all out if requested.
Until now, our compiler has produced a string, or thrown an exception. We have
already seen the use of the ('a, 'b) result
type to return either an
Ok
value or an Error
message. In phases.ml
, I’ve combined
these ideas for you.
First, the file contains several constructors for naming each phase of the
compiler, and it defines an 'a pipeline
to be a result
containing
either
An
'a
answer, paired with a list of phases describing how we got here; orA list of exceptions, paired with a list of phases describing how far we’d gotten in compilation
The 'a
part of the answer is whatever the most recent compiler phase
produced as its result.
Next, I’ve defined a few helper functions to add a new phase onto a growing
pipeline. Look carefully at the type for add_phase
: it takes a function
that transforms an 'a
into a 'b
, and an 'a pipeline
, and
produces a 'b pipeline
as a result...and turns the 'b
answer into a
phase and adds it to the growing list. add_phase
should be used for
phases of the compiler that you don’t expect to ever fail: any exceptions that
arise are internal compiler errors. By contrast add_err_phase
takes a
function that takes the current 'a
result and produces an
('b, exn list) result
, that is, it might reasonably produce multiple error messages.
To use these functions, look at the end of compile.ml
:
let compile_to_string (prog : sourcespan program pipeline) : string pipeline =
prog
|> (add_phase well_formed check_scope)
|> (add_phase tagged tag)
|> (add_phase renamed rename)
|> (add_phase anfed (fun p -> tag (anf p)))
|> (add_phase result compile_prog)
;;
(Note: As written, all of these phases either produce an answer or
raise
a single exception. You may want to modify one or more of these
phases to be used by add_err_phase
, depending on how you report errors.)
The |>
operator is reverse function application: x |> f
is the exact
same thing as (f x)
, but it allows us to write pipelines in a neatly
chained form.
You are welcome to add phases to your compiler, should you wish to.
Finally, look at main.ml
. If you run ./main -t yourFile.cobra
,
the compiler will print out the trace of the entire pipeline. If you leave out
the -t
option, it will print the output assembly just as before.
3.5 Testing Functions
3.5.1 Unit testing parts of the compiler
These are the same as they were for Boa. ANF is provided, and hasn’t changed
aside from the addition of new primitives and EBool
. So your tests should
focus on te
and t
tests.
If your program exits with -10 as its exit code, it probably has segfaulted,
meaning it tried to access memory that was not allocated. If you’re familiar
with tools like valgrind, you can run valgrind output/some_test.run
in
order to get a little more feedback. This can sometimes tip you off quite well
as to how memory is off, since sometimes you’ll see code trying to jump to a
constant that’s in your code, or other obvious tells that there’s something off
in the stack. Also, if you’ve done all your stack management correctly,
valgrind will report a clean run for your program!1You may need to
install the libc6-dbg
package.
3.5.2 Integration testing the compiler
In addition to the unit-testing functions, I have added some structure to the
input/
directory, and added support for this in your test.ml
file. There are now four subdirectories:
input/do_pass
: should contain programs that do indeed compile through your compiler and run correctly.input/do_err
: should contain programs that do indeed produce compile or runtime errorsinput/dont_pass
: should contain programs that you think should work properly, but that your compiler doesn’t currently compile correctlyinput/dont_err
: should contain programs that you think should produce compile or runtime errors, but that your compiler compiles for some reason
To specify the intended output or error messages of these programs, read the
README
files in each directory that explains what files you should
create.
Finally, the input_file_test_suite ()
in test.ml
will run all the
programs in your input/
directory as part of your oUnit test suite. I
suspect this will be a much easier way to produce larger-scale test cases than
writing everything in test.ml
directly.
4 Recommended TODO List
Here’s an order in which you could consider tackling the implementation:
Fix the
print
function inmain.c
so that it prints out the right output. It will need to check for the tag using C bitwise operators, and useprintf
or one of its variants to print the right value.Take a first shot at figuring out how to increase the stack appropriately by using
count_vars
.Fill in the
EPrim1
case for everything butprint
, and figure out how to check for errors, and call the “non-number” error reporting function. Test as you go. Be aware that if the function call segfaults, it may be because you need to refine step 2.Implement compiling
print
to assembly by pushing appropriate arguments, thencall
ingprint
. Be aware that if the call doesn’t work, it may be because of step 2 again. Test as you go; be aware that you should test interesting sequences of print expressions and let-bindings to make sure your stack integrity is good before and after calls.Fill in all of the
EPrim2
cases, using the error-reporting from the last step. Test as you go.Complete the if case and test as you go.
5 Running main
Running your own programs is the same as with Boa, except you’ll give them
the .cobra
file extension.
You can also use the -t
command-line flag (described above) to print
out a trace of the compilation process.
6 List of Deliverables
your
compile.ml
any additional modules you saw fit to write
tests in an OUnit test module (
test.ml
)any test input programs (
input/*.cobra
files orinput/*/*
subdirectories)any additional files that are needed to build your tests (e.g.
main.c
, etc)
Again, please ensure the makefile builds your code properly. The black-box tests will give you an automatic 0 if they cannot compile your code!
DO NOT SUBMIT YOUR .git
DIRECTORY! For that matter, don’t submit
your output
or _build
directories.
7 Grading Standards
For this assignment, you will be graded on
Whether your code implements the specification (functional correctness),
the clarity and cleanliness of your code, and
the comprehensiveness of your test coverage
8 Submission
Wait! Please read the assignment again and verify that you have not forgotten anything!
Please submit your homework to https://handins.khoury.northeastern.edu/ by the above deadline.