Lecture 7: Defining functions
Last time we built up the infrastructure for calling functions in a manner that’s compatible with the C calling convention, so that we could interact with functions defined in our main.c runtime. This prompts the obvious generalization: can we expand our source language to include function calls too? Could we expand it further to define our own functions?
1 Warmup: a new primitive
Exercise
Enhance your compiler with a new unary primitive,
This is not particularly general-purpose, but it allows us to focus on providing a new runtime function and using it from within our language, rather than by needing to create additional syntax just yet.
We’ll need to define a function in main.c, certainly:
int print(int val) {
// DUMMY CODE
printf("%010x\n", val);
return val;
}
We should enhance this code to properly print all of our runtime values, rather than just show us the raw bytes.
Do Now!
Refactor main.c so that the existing code for printing the final value (as used in
main()
) is used here too.
We’ll need to expose this code to our generated assembly: add a line to our program’s prologue that declares
extern print
Finally, we need to generate code for the new Print
prim1
operator.
This should reuse the same function-call mechanisms that we developed last time
for reporting errors.
2 A general-purpose call expression
Of course, we can’t add new hard-coded primitive names to our language every
single time we want to define a new function. Moreover, if we want to define
functions of more than one argument, we can’t just use prim2
(whose syntax
is all infix operators) or prim3
(which doesn’t exist)...
Exercise
Extend the source language with a new expression form for function calls
Give examples of functions we’d like to provide, and examples of their use
Extend the abstract syntax and its semantics
Extend our transformations
Test the new expressions
2.1 Concrete syntax
We’ll start with concrete syntax: a function call starts with a function name
and takes zero or more comma-separated expressions as arguments. There is no
whitespace permitted between the function name and the left-parenthesis.1There’s
a fiddly syntactic nuisance to deal with here. Our language already accepts
identifiers as expressions by themselves. We might also want to add parenthesized
expressions for clarity’s sake. Once we do that, though, it becomes difficult
to tell whether f(1 + 2)
is a function call, or an identifier followed by
a parenthesized expression. While it might seem obvious to us, by taking a
holistic look at the language we currently have, that we can never have two
consecutive expressions, it’s not at all obvious to the parser. In practice
there are a number of workarounds for this: the one we choose for now is to say
that whitespace is prohibited, and that when no whitespace is present we try to
parse the code as a function call. When we talk about parsing techniques later
in the course, we’ll talk about other ways we might resolve this ambiguity.
‹expr› ... IDENTIFIER ( ‹exprs› ) IDENTIFIER ( ) ‹exprs› ‹expr›‹expr› , ‹exprs›
2.2 Examples
For our examples, let’s design max
, which takes two numeric arguments and
returns the larger of the two.
Do Now!
Define the
max
function in main.c.
It’s pretty easy to define the max
function, assuming the arguments are
indeed numbers! But we can’t make that assumption in our runtime: it would be
a shame for max(5, false)
to return false
. Instead, we’ll need to
include logic in our runtime implementation of max
that mimics the
tag-checking emitted by our compiler.
We’d expect the following examples to work:
Source |
| Output |
|
|
|
|
|
|
|
|
|
2.3 Abstract syntax and Semantics
Do Now!
What should the semantics of
f(e1, e2, ..., en)
be? How should we represent this in our'a expr
data definition? What knock-on effects does it have for the transformation passes of our compiler?
The first thing we might notice is that attempting to call an unknown function
should be prohibited —
Do Now!
What other programmer mistakes should we try to catch with well-formedness checking? What new mistakes are possible with function calls?
What should happen if a programmer tries to call max(1)
or max(1, 2,
3)
? Certainly nothing good can happen at runtime if we allowed this to
occur. Fortunately, we can track enough information to prevent this at
well-formedness time, too. Our function environment should keep track of known
function names and their arities. Then we can check every function call
expression and see whether it contains the correct number of actual arguments.
We need more examples:
Source |
| Output |
|
|
|
|
|
|
|
|
|
To represent call expressions in our AST, we just need to keep track of the function name, the argument list, and any tag information:
type 'a expr = ...
| ECall of string * 'a expr list * 'a
Do Now!
Extend
tag
anduntag
to support this new construction.
We need to consider how our expression evaluates, which in turn means considering how it should normalize under ANF.
Do Now!
What are the design choices here?
Since ECall
expressions are compound, containing multiple subexpressions,
they probably should normalize similar to how we normalize prim2
expressions: the arguments should all be made immediate.
let rec is_anf (e : 'a expr) : bool =
match e with
...
| ECall(f, args, _) -> List.for_all is_imm args
We have at least two possible designs here, for how to normalize these expressions: we can choose a left-to-right or right-to-left evaluation order for the arguments. For consistency with infix operators, we’ll choose a left-to-right ordering.
Do Now!
What tiny example program, using only the expressions we have so far, would demonstrate the difference between these two orderings?
Do Now!
Define the ANF transformation for
ECall
.
2.4 Making the call
Once we’ve confirmed our program is well-formed, and subsequently ANFed it,
what information do we need to retain in our compiler in order to finish the
compilation? Do we still need the function environment? Not really! Assuming
that the function name is the same as label name that we call
, we don’t
need anything else but that name and the immediate arguments of the call.
After that, we output the same calling code as when implementing Print
above.
Exercise
Redefine
prim1
operator.
3 Defining our own functions
Being able to call other functions is just one side of the story; being able to define our own is the other half. We’ve already discussed the responsibilities of a function body in order to be a responsible participant in the C call stack. Now we merely need to “do the same thing” for our own functions.
Exercise
Extend the source language with syntax for function definitions
Give examples of functions we’d like to provide, and examples of their use
Extend the abstract syntax and its semantics
Extend our transformations
Test the new expressions
This time, we need to really change our syntactic structure. Our programs can’t just be expressions any longer: we need some notion of function definitions, too.
‹program› ‹decls› ‹expr› ‹expr› ‹decls› ‹decl› ‹decl› ‹decls› ‹decl› def IDENTIFIER ( ‹ids› ) : ‹expr› def IDENTIFIER ( ) : ‹expr› ‹ids› IDENTIFIER IDENTIFIER , ‹ids› ‹expr› ...
A ‹program› is now a list of zero or more function declarations, followed by a single expression that is the main result of the program.
As an example:
def sum3(x, y, z):
x + y + x
sum3(4, 5, 6)
This program should evaluate to 15.
Our AST representation now grows to match:
type 'a decl =
(* function name, argument names, body, tag *)
| DFun of string * string list * 'a expr * 'a
type 'a program =
| Program of 'a decl list * 'a expr
Do Now!
What new semantic concerns do we have with providing our own definitions?
As soon as we introduce a new form of definition into our language, we need to consider scoping concerns. One possibility is to declare that earlier definitions can be used by later ones, but not vice versa. This possibility is relatively easy to implement, but restrictive: it prevents us from having mutually-recursive functions. Fortunately, because all our functions are statically defined, supporting mutual recursion is not all that difficult; the only complication is getting the well-formedness checks to work out correctly.
Exercise
Do so.
Additionally, the bodies of function definitions need to consider scope as well.
def sum3(x, y, z):
a + b + c
x + 5
This program refers to names that are not in scope: a
, b
and c
are not in scope within sum3
, and x
is not in scope outside of it.
def f(x): x
def f(y): y
f(3)
Repeatedly defining functions of the same name should be problematic: which function is intended to be called?
def f(x, x): x
f(3, 4)
Having multiple arguments of the same name should be problematic: which argument should be returned here?
Do Now!
What should we do with the following program, assuming our work above on builtin functions is completed?
def max(x, y): if x < y: y else: x true
We need to enhance our notion of function declarations, to include the builtin ones:
type 'a decl =
| DFun of string * string list * 'a expr * 'a
(* function name, arity *)
| DBuiltin of string * int
Exercise
Complete the remaining stages of the pipeline: enhance ANF to work over programs; generate valid assembly output for each of our functions using the calling convention we discussed last time; and write test programs to confirm that the scoping of functions works properly. Can you support recursion? Mutual recursion?
4 Testing
Now that we have functions and builtins, especially ones that can produce output, we’re gaining the ability to write non-trivial test programs. At this point, it starts becoming more useful to write integration tests, namely entire programs in our language that we run through the entire compiler pipeline and execute. Unit tests still have their place: it’s very easy to make a tiny mistake somewhere in the compiler and produce bizarre or inexplicable output. Narrowing down the cause of the error is tricky, and requires careful attention to each stage of our pipeline.
Additionally, now that we’re manipulating the stack in earnest, we should be particularly careful that we conform to the calling convention. Valgrind is a tool that’s designed to help check such issues. Once you’ve compiled output/foo.run to produce an executable, executing valgrind output/foo.run will run your program within a sandboxed environment that can check for common mistakes in the calling convention. A clean valgrind run will report no errors. Interpreting valgrind errors can be tricky, but a useful strategy (as always) is to minimize the input program until there’s hardly anything left, and removing anything else seems to make the problem disappear. At that point, start diving into the compiler phases that influence that output, and write unit tests for them.
1There’s
a fiddly syntactic nuisance to deal with here. Our language already accepts
identifiers as expressions by themselves. We might also want to add parenthesized
expressions for clarity’s sake. Once we do that, though, it becomes difficult
to tell whether f(1 + 2)
is a function call, or an identifier followed by
a parenthesized expression. While it might seem obvious to us, by taking a
holistic look at the language we currently have, that we can never have two
consecutive expressions, it’s not at all obvious to the parser. In practice
there are a number of workarounds for this: the one we choose for now is to say
that whitespace is prohibited, and that when no whitespace is present we try to
parse the code as a function call. When we talk about parsing techniques later
in the course, we’ll talk about other ways we might resolve this ambiguity.