Assignment 4: Cobra: Multiple types of values

8.14

Assignment 4: Cobra: Multiple types of values🔗

Due: Friday 02/07 at 9:00pm

git clone

In this compiler, you’ll deal with COded Binary RepresntAtions of values.

1 The Cobra Language🔗

1.1 Concrete Syntax🔗

The concrete syntax of Cobra is very similar to Boa, with a few new additions.

‹prim1›: ... | ! | print | isbool | isnum ‹expr›: ... | true | false | ‹expr› < ‹expr› | ‹expr› > ‹expr› | ‹expr› <= ‹expr› | ‹expr› >= ‹expr› | ‹expr› == ‹expr› | ‹expr› && ‹expr› | ‹expr› || ‹expr›

1.2 Abstract Syntax🔗

The abstract syntax is very similar to Boa, also:

type prim1 = ...
  | Print
  | IsBool
  | IsNum
  | Not

type prim2 = ...
  | And
  | Or
  | Greater
  | GreaterEq
  | Less
  | LessEq
  | Eq

type 'a expr = ...
  | EBool of bool * 'a

1.3 Semantics🔗

The semantics of booleans are straightforward. The semantics of EIf changes slightly: its condition must evaluate to a boolean, and it branches on the truth or falsehood of that value, rather than whether it’s nonzero.

With the addition of two types to the language, there are two main changes that ripple through the implementation:

The representation of values
The possibility of errors

There is one other major addition, which is the print primitive, discussed more below.

The representation of values requires a definition. We’ll use the following representations for the Cobra runtime:

true will be represented as the constant 0xFFFFFFFFFFFFFFFF. (NOTE: we used this value in class, but the lecture notes only specified the outermost bits; this is mandating that the middle bits all be set, too.)
false will be represented as the constant 0x7FFFFFFFFFFFFFFF. (NOTE: we used this value in class, but the lecture notes only specified the outermost bits; this is mandating that the middle bits all be set, too.)
numbers will be represented with a zero in the rightmost bit, by shifting them left 1 bit. So, for example, 2 is represented as 0x0000000000000004.

You should augment the provided print function in main.c to print these values correctly: true and false should print as those words, and numbers should print out as the underlying number being represented.

You should raise errors in the following cases:

-, +, *, add1 and sub1 should raise an error (by printing it out) with the substring "arithmetic expected a number" if the operation’s argument(s) are not numbers. You can print more information than this if you like, but the message must have at least this substring.
<, <=, > and >= should raise an error with the substring "comparison expected a number" if the arguments are not both numbers.
+, -, *, add1 and sub1 should raise an error with the substring "overflow" if the result overflows, and falls outside the range representable in 63 bits. The jo instruction will be useful: it jumps if the last instruction overflowed.
if should raise an error with the substring "if expected a boolean" if the conditional value is not a boolean.
and and or should should short-circuit if their left argument determines the answer fully, without evaluating or tag-checking the right argument. Otherwise, they should raise an error with the substring "logic expected a boolean" if the arguments are not both booleans.
not should raise an error with the substring "logic expected a boolean" if its argument is not a boolean.
== should work polymorphically, regardless of the types of its inputs.

These error messages should be printed on standard error. The other operators never fail, and so never produce errors.

We add two new primitive operations, isbool and isnum. These two operations have an effective type of Any -> Bool, and will return true if the argument they receive is indeed a boolean or number, respectively.

The last required primitive operation is print, which prints its single argument to the command line, and then returns it. The print function in main.c explicitly returns its argument; you will need to retrieve and use that value in your compiled output.

2 Examples🔗

The expression
```
let x = 1 in
let y = print(x + 1) in
print(y + 2)
```
will output
```
2
4
4
```
The first 2 comes from the first print expression. The first 4 comes from the second print expression. The final line prints the answer of the program as usual, so there’s an “extra” 4.

The expression

if 54: true else: false

prints (on standard error) something like:

Error: if expected a boolean, got 54

3 Implementation strategies🔗

3.1 Rendering errors🔗

To display error messages on standard error, you’ll need to use a call something like:

fprintf(stderr, "Error: arithmetic expected a number");

I recommend that you design a function void error(int errCode) in main.c, that handles all errors in uniform manner. Then, you should add a suffix to your generated assembly (i.e., change what goes in compile_anf_to_string) that looks something like:

err_arith_not_num:
  mov RDI, <whatever your error code for arithmetic-op-didn't-get-numbers is>
  call error

You are welcome to make this signature more elaborate, to pass the mistaken value into the error handler so that it can be printed as part of the error message. We sketched this possibility out in class and in the lecture notes.

3.2 Memory Layout and Calling C Functions🔗

In order to set up the stack properly to call C functions, like print and your error functions, it’s necessary to make a few changes to what we had in Boa.

Allocating stack space ahead of time: At the start of our generated code (which we now recognize is a function body participating in a C runtime stack), we need to make sure we make enough stack space for all variables we create, and reserve that space ahead of time. To do this, we move RSP to point to a location that is N words away (so N * 8 bytes for us), where N is the greatest number of variables we need at once. This is actually tricky to compute to be fully optimal (teaser for later in the semester: by “tricky” I mean NP-hard), but it’s easy to get an OK heuristic — we can compute the maximum depth of nested definitions.
To do this, we need the count_vars function. This is a straightforward function and not very interesting right now, so I’ve provided it. You need to add instructions to compiled output in order to make sure the correct space is allocated on the stack by subtracting the right amount from RSP. You may choose to implement this via a simple increment of RSP, or you may choose to explicitly push an appropriate number of zeros onto the stack, to clear out any uninitialized values. (The former is more efficient; the latter might be helpful for debugging.)
Using the Base Pointer: In addition, this means that all variable references need to happen from RBP rather than from RSP. This is because RSP can change while we are pushing arguments onto the stack for other function calls, so RBP is the place we should trust for consistent offsets.
Participating in the C stack: As a C function callee (from main) and caller (of error functions and print), our code has some responsibilities. First, we need to store the old base pointer upon entry, and update the base pointer to hold the current top of the stack (which includes the return pointer into main, for example). This is why the typical top two lines of most C functions are:
```
push RBP
mov RBP, RSP
```
Similarly, when we’re done with the function, we need to restore the stack pointer to its old location, and put the old base pointer value back. This is why the last lines before a ret in a C function are often:
```
mov RSP, RBP
pop RBP
```
Other Responsibilities: If we were using registers beyond RAX, RBP, and RSP, we’d be responsible for storing some of them as callee, and some as caller. But we’re not going to go into those details for this assignment. Since we aren’t using those registers, it has no effect on our code’s behavior.

If you write any functions in main.c that you need to be available in your assembly, you need to declare them in the assembly via:

;; In your generated assembly
extern <exported name of function>

They also need to be declared in main.c such that the compiler won’t mangle their names, via

// In main.c
extern <your function signature here> asm("exported name of function");

If you forget either of these, your code will not link correctly.

3.3 New Assembly Constructs🔗

Sized: You may run into errors that report that the size of an operation is ambiguous. This could happen if you write, for example:
```
cmp [RBP-8], 0
```
because the assembler doesn’t know if the program should compare a four-byte zero, a one-byte zero, or something in between into memory starting at [RBP-8]. To solve this, you can supply a size:
```
cmp [RBP-8], QWORD 0
```
The QWORD keyword tells the assembler to use the “quad word” size for 0, which corresponds to 64 bits. A WORD corresponds to 16 bits, a BYTE corresponds to 8 bits, and a DWORD corresponds to 32 bits. To get a sized argument, you can use the Sized constructor from arg.
HexConst: Sometimes it’s nice to read things in hex notation as opposed to decimal constants. I’ve provided a new HexConst arg that’s useful for this case, and augmented arg_to_asm for you to support it. It has the same meaning as Const, and just prints out differently.
IPush, IPop: These two instructions manage values on the stack. push adds a value at the current location of RSP, and decrements RSP to point past the added value (remember, the stack grows toward numerically-smaller addresses, even though we draw it as growing upward in diagrams). pop increments RSP and moves the value at the location RSP was pointing to into the provided arg.
ICall: A call does two things:
- Pushes the next code location onto the stack (just like a push), which becomes the return pointer
- Performs an unconditional jump to the provided label
Note that call does not affect RBP, which the program must maintain on its own.
IShr, IShl, ISar: Bit shifting operations. sar (shift-arithmetic-right) preserves the sign bit of the value being shifted, while shr (shift-right) simply fills in a zero bit.
IAnd, IOr, IXor: Bitwise logical operations.
ITest: Performs a bitwise-and of the two arguments, and records whether the result was zero, signed, or overflowed. This can be used in tandem with jumps like jo or jz.
IJo, IJno: Jump to the provided label if the last arithmetic operation did/did not overflow
ILineComment and IInstrComment: these allow you to add a comment on a line by itself, or in-line after an instruction. These might be helpful for readability of your generated output, while you’re debugging.

3.4 Some software engineering considerations🔗

Our compiler is getting sufficiently sophisticated that the pipeline is several stages long already. Bugs can occur in almost any stage of the pipeline. As a result, one very useful debugging trick is to engineer the compiler to save each stage of the pipeline, and print them all out if requested. Until now, our compiler has produced a string, or thrown an exception. We have already seen the use of the ('a, 'b) result type to return either an Ok value or an Error message. In phases.ml, I’ve combined these ideas for you.

First, the file contains several constructors for naming each phase of the compiler, and it defines an 'a pipeline to be a result containing either

An 'a answer, paired with a list of phases describing how we got here; or
A list of exceptions, paired with a list of phases describing how far we’d gotten in compilation

The 'a part of the answer is whatever the most recent compiler phase produced as its result.

Next, I’ve defined a few helper functions to add a new phase onto a growing pipeline. Look carefully at the type for add_phase: it takes a function that transforms an 'a into a 'b, and an 'a pipeline, and produces a 'b pipeline as a result...and turns the 'b answer into a phase and adds it to the growing list. add_phase should be used for phases of the compiler that you don’t expect to ever fail: any exceptions that arise are internal compiler errors. By contrast add_err_phase takes a function that takes the current 'a result and produces an ('b, exn list) result, that is, it might reasonably produce multiple error messages.

To use these functions, look at the end of compile.ml:

let compile_to_string (prog : sourcespan program pipeline) : string pipeline =
  prog
  |> (add_phase well_formed check_scope)
  |> (add_phase tagged tag)
  |> (add_phase renamed rename)
  |> (add_phase anfed (fun p -> tag (anf p)))
  |> (add_phase result compile_prog)
;;

(Note: As written, all of these phases either produce an answer or raise a single exception. You may want to modify one or more of these phases to be used by add_err_phase, depending on how you report errors.)

The |> operator is reverse function application: x |> f is the exact same thing as (f x), but it allows us to write pipelines in a neatly chained form.

You are welcome to add phases to your compiler, should you wish to.

Finally, look at main.ml. If you run ./main -t yourFile.cobra, the compiler will print out the trace of the entire pipeline. If you leave out the -t option, it will print the output assembly just as before.

3.5 Testing Functions🔗

3.5.1 Unit testing parts of the compiler🔗

These are the same as they were for Boa. ANF is provided, and hasn’t changed aside from the addition of new primitives and EBool. So your tests should focus on te and t tests.

If your program exits with -10 as its exit code, it probably has segfaulted, meaning it tried to access memory that was not allocated. If you’re familiar with tools like valgrind, you can run valgrind output/some_test.run in order to get a little more feedback. This can sometimes tip you off quite well as to how memory is off, since sometimes you’ll see code trying to jump to a constant that’s in your code, or other obvious tells that there’s something off in the stack. Also, if you’ve done all your stack management correctly, valgrind will report a clean run for your program!1You may need to install the libc6-dbg package.

3.5.2 Integration testing the compiler🔗

In addition to the unit-testing functions, I have added some structure to the input/ directory, and added support for this in your test.ml file. There are now four subdirectories:

input/do_pass: should contain programs that do indeed compile through your compiler and run correctly.
input/do_err: should contain programs that do indeed produce compile or runtime errors
input/dont_pass: should contain programs that you think should work properly, but that your compiler doesn’t currently compile correctly
input/dont_err: should contain programs that you think should produce compile or runtime errors, but that your compiler compiles for some reason

To specify the intended output or error messages of these programs, read the README files in each directory that explains what files you should create.

Finally, the input_file_test_suite () in test.ml will run all the programs in your input/ directory as part of your oUnit test suite. I suspect this will be a much easier way to produce larger-scale test cases than writing everything in test.ml directly.

4 Recommended TODO List🔗

Here’s an order in which you could consider tackling the implementation:

Fix the print function in main.c so that it prints out the right output. It will need to check for the tag using C bitwise operators, and use printf or one of its variants to print the right value.
Take a first shot at figuring out how to increase the stack appropriately by using count_vars.
Fill in the EPrim1 case for everything but print, and figure out how to check for errors, and call the “non-number” error reporting function. Test as you go. Be aware that if the function call segfaults, it may be because you need to refine step 2.
Implement compiling print to assembly by pushing appropriate arguments, then calling print. Be aware that if the call doesn’t work, it may be because of step 2 again. Test as you go; be aware that you should test interesting sequences of print expressions and let-bindings to make sure your stack integrity is good before and after calls.
Fill in all of the EPrim2 cases, using the error-reporting from the last step. Test as you go.
Complete the if case and test as you go.

5 Running main🔗

Running your own programs is the same as with Boa, except you’ll give them the .cobra file extension.

You can also use the -t command-line flag (described above) to print out a trace of the compilation process.

6 List of Deliverables🔗

your compile.ml
any additional modules you saw fit to write
tests in an OUnit test module (test.ml)
any test input programs (input/*.cobra files or input/*/* subdirectories)
any additional files that are needed to build your tests (e.g. main.c, etc)

Again, please ensure the makefile builds your code properly. The black-box tests will give you an automatic 0 if they cannot compile your code!

DO NOT SUBMIT YOUR .git DIRECTORY! For that matter, don’t submit your output or _build directories.

7 Grading Standards🔗

For this assignment, you will be graded on

Whether your code implements the specification (functional correctness),
the clarity and cleanliness of your code, and
the comprehensiveness of your test coverage

8 Submission🔗

Wait! Please read the assignment again and verify that you have not forgotten anything!

Please submit your homework to https://handins.khoury.northeastern.edu/ by the above deadline.

1You may need to install the libc6-dbg package.

1	The Cobra Language
2	Examples
3	Implementation strategies
4	Recommended TODO List
5	Running main
6	List of Deliverables
7	Grading Standards
8	Submission