Writing tests is an essential part of design and implementation. The most important skill in writing tests is to determine what to test, and then determine how to test.
Testing a compiler is particularly nuanced: if you find what looks to be a bug, how can you localize it? The problem could be
You have misunderstood the semantics of the language you’re trying to compile, and so have generated incorrect code.
You have misunderstood the semantics of the language in which you’re writing the compiler, and so it isn’t doing what you expect.
You have misunderstood the semantics of the target language to which you’re compiling, and so the generated code isn’t doing what you expect.
You have correctly understood all the languages involved, but just made a small mistake.
The program being compiled is itself buggy.
One of the phases of the compiler is buggy.
Several phases of the compiler are buggy, and conspire to usually work correctly anyway.
You’ve forgotten some invariant about some aspect of your codebase, and therefore violated it.
You’re compiling multiple files, and they weren’t all compiled with the same version of your compiler.
Programming under this level of uncertainty is like fighting quicksand: The more you struggle and the more things you change, the less likely it is that you’ll figure out the underlying problem and get unstuck. So what to do?
The one thing you can absolutely rely upon is that if your OCaml code compiles, then it is type-correct: you will never misuse a value of one type as if it were of some other type. This means that if you can encode important invariants about your program in the types, then the OCaml compiler itself will enforce that they are upheld. So before you dive into hacking away, consider the signatures of your functions very carefully.
Imagine you have some type signature
Typ1 Typ2 -> Typ3, and you need to
provide some implementation of that type. Additionally you have some
specification of what that function is supposed to compute: You must check that
your concrete implementation of that type works as specified. However thinking
about tests after you have completed the implementation is not ideal. Since you
have already written your implementation, you will likely come up with tests
that you already know will pass, rather than tests that should
pass. Here are some recommendations on how to come up with effective test
Follow this workflow: Write signatures > Write an empty implementation > Write test cases. Writing an empty implementation (all the functions are present, but essentially empty) will ensure that referring to the implementation in your test cases does not produce compiling errors. Fill in the implementation after writing your test cases.
Convince yourself that the code to be tested cannot be trusted, and it’s up to you to find any mistakes. Often a role reversal helps: imagine the instructor was writing code for the homework, and you get credit for finding mistakes in the instructor’s code! Be creative: where might the gotchas be in the design, and how might someone else misunderstand or mis-implement the design?
Look at each function in isolation. Think about what behavior you expect when all inputs are correct and as expected (if you wrote the interface be sure to document its intended behavior when you are writing it!). Remember that a test passes if the expected behavior is the same as the actual behavior.
Look at each function in isolation. Think about every possibility of passing correct and incorrect parameters, and figure out how to check for whether the function behaves as expected in each situation.
Look at each function in isolation. Carefully read about its objectives, including exceptional cases. Think about how you will verify that an implementation of that function actually fulfills these objectives. How will you reproduce exceptional cases so that you can test them?
Now think about the sequences in which various functions might be called. Is there a “correct” sequence of calling them? What happens when they are called “out of sequence,” and what should happen?
Remember that the best option is to catch incorrect uses of your code at compile time, such that client code that incorrectly uses your code will produce compile-time errors. This ensures that if someone uses your code incorrectly, they cannot even compile their program. The next best option is to flag incorrect uses at run time (e.g. using exceptions). This ensures that client code will produce errors when it is run, because it used your code incorrectly. The unacceptable option is to hope that everybody will read your documentation (or worse, your code), understand it, and use it accordingly—
and therefore convince yourself there’s no need to actually put checks in your code. Remember that if documentation is vague somebody will misinterpret it, if functionality does not consider some specific scenario somebody will produce it. Better your tests than the user.
Writing tests before writing your implementation will give you insight into what your implementation ought to do. Moreover, it will help you work through the types that you have versus the ones you may want, and often just understanding that structure is a big help in understanding the problem.
Obviously, unfortunately, you often can’t write a complete set of tests for your code before you’ve started writing your code, as the process of implementing a design can easily bring issues to the forefront that you didn’t notice or anticipate. Proper testing is an iterative process: starting from initial examples you create an initial implementation, which might suggest additional tests, which might cause test failures, which you need to fix, which might suggest additional tests, etc. A successful set of test cases is one that tests whether your implementation adheres to your design, whether your design leaves loopholes and ambiguities that allow its incorrect usage, and whether the behavior of your implementation can be predicted in all situations. This set of test cases should compile, and upon running, should pass.
NOTE: It is far better to include tests that you know to fail, rather
than comment them out or delete them. Leave a
(* FIXME *) comment next to
the failing tests, explaining what you intended the test to check for, and why
you think it’s currently failing. At some point you clearly had a reason for
writing the test case, and it would be a shame to lose that insight by deleting
the test! Equally bad is commenting the test out, since it gives the
misleading impression that everything is fine and all tests pass, when there
are known problems remaining...
There are many kinds of tests you may wish to write:
Unit tests are the style of tests we’ve written all along: they test the smallest components of your program—
individual functions, classes, or interfaces, for example— and confirm that they work as expected. Unit tests are useful for confirming that edge cases are properly handled, that algorithms seem to work as expected on their inputs, etc.
Regression tests are the kinds of tests you always regret not having written sooner. They are written as soon as you notice a bug in your code and fix it: their purpose is to ensure that the bug can never creep back into your program inadvertently. Regression tests are especially useful for compilers, since there are often so many interacting parts that it is easy to reintroduce bugs that might have been fixed before. Write regression tests even for the simplest of bugs: if you were inattentive enough to make that mistake once, you could make it again, and so could your colleagues. Let them, and your future self, benefit from noticing the bug now!
Integration tests test larger units of functionality, or indeed even libraries at a time. They are trickier to write, because their inputs are usually larger and more structured: for instance, testing that a sequence of user inputs produces the correct sequence of outputs. These might try sending an entire program through your compiler and checking that its final behavior is as expected.
Randomized or “fuzz” tests are designed to rapidly explore a wider space of potential inputs than can easily be written manually. Typically these tests require writing the code to be tested (obviously!), the code to randomly generate inputs, and either a secondary implementation of the code being tested or a predicate that can confirm the proper operation of that code. These latter two are known as oracles, because they never make mistakes, but you have to interpret their results carefully. Fuzz testing is fantastic for checking (for example) the robustness of the error handling of your program, to see whether it holds up without crashing even under truly odd inputs. Likewise, it’s particularly good at generating wacky-but-syntactically-valid input programs to test your compiler. (Fuzz testing with malicious intent is one of the tools hackers use to exploit weaknesses in systems.)
At the moment, we aren’t exploiting OCaml’s support for packaging too heavily:
every file is a package, and there are no subdirectories or "Java packages" to
worry about. You can even place all your tests within a single
file. Still, it’s worth breaking the tests out into separately-named test
suites, so that their purpose and organization is more readily apparent.