On this page:
Motivation
Mechanics
Grading
What Examplar grading looks like
Hints for writing good Examplar examples
8.12

Examplar🔗

Motivation🔗

In a typical homework scenario, students would write their implementation, and write test cases to demonstrate that it works. The implementations would be checked against instructor-written tests — which can cause frustration when the student’s code matches the student’s own tests, but does not match the instructor’s tests. Why might this occur?

Any design effort, no matter what discipline, requires both understanding the requirements of the problem domain and then subsequently implementing those requirements correctly. Instructor-written tests only check whether the student implementation is correct, but do not provide an automated way to check whether the student’s understanding of the requirements is correct.

Remember: even if both of them are encoded as checkExpect calls inside testSomething methods, examples build intuition and understanding about a problem; test cases build confidence that an implementation matches its requirements.

Examplar is a system, initially designed by researchers at Brown University, that enables students to write examples that demonstrate common behaviors and common misunderstandings of the problem domain. Concretely, the professors will write many implementations of the problem: some of them are correct, which we will call wheats, while some of them are buggy, which we will call chaffs. Your goal is to write a set of examples that separate the wheats from the chaffs, as follows: Each example must be correct, in that the example must pass on all the wheat implementations. And collectively, all the examples must be thorough, such that every chaff causes at least one test to fail.

Each of the chaffs in this course will have one plausible bug in them; they will not contain multiple mistakes. The chaffs are designed to highlight common misconceptions about the designs you are working on; they’re not designed to be cryptic. (There are no chaffs that can only be triggered by guessing some intricate Konami cheat code!)

Once you’ve developed a suite of examples, it should then take minimal effort for you to reuse those examples to guide your own implementation as you develop it. As we explain below, you’ll be able to use your Examplar work completely unchanged. (Note that your implementation still deserves many more tests than the examples you submit for Examplar! Nevertheless the examples will give you a solid start.)

Mechanics🔗

We will provide you with an interface for you to code against, and a stub implementation of that interface. The stub will simply throw exceptions in all of its methods; it does not have any interesting behavior of its own. It is only there to ensure that your code compiles successfully.

You will write your examples as follows:

// This interface and class name will be given to you in the starter code import the.provided.package.IFoo;
import the.provided.package.FooImplementation;
 
class ExamplarTestsClassName { // class name will be specified by the assignment boolean testYourScenarioHere(Tester t) {
IFoo foo = new FooImplementation(...);
return t.checkExpect(...);
}
...
}

When Examplar runs, it will replace the stub implementation with each of the wheats and chaffs in turn, and run your test methods against each such implementation.

When you submit your work, submit only your ExamplarTestsClassName.java file — do not include the starter files we gave you.

Once you’ve finished an Examplar assignment, your examples can be used completely unchanged to then test your own implementation: you’ll simply remove the import lines at the top, so that you’re testing your own code instead of the stubs. This helps ensure that before you’ve spent any time trying to test incorrect code, you already have a good baseline of correct examples that can definitely help catch incorrect code.

Grading🔗

An Examplar submission will be evaluated on four criteria:

The phrasing of the options above might seem a bit weird. Why say “how many wheats do not cause any test methods to fail,” rather than “how many wheats pass all the examples”? Likewise, why do we say that chaffs should each “cause at least one of the test methods to fail”? Put simply: because not every test method tests every behavior simultaneously. In an object-oriented setting, when you are testing an interface with multiple methods, it’s entirely possible that a chaff might have a bug in one method, while some test might happen to only call a different method. In that scenario, it’s not accurate to say that your test “passed” or “failed” – it’s just not relevant to that chaff.

(In the extreme case, consider the following two lousy tests:

boolean testThatAcceptsEverything(Tester t) {
return t.checkExpect(true, true);
}
 
boolean testThatRejectsEverything(Tester t) {
return t.checkExpect(true, false);
}
It doesn’t seem particular fair to say that the testThatAcceptsEverything method “passed” on a wheat, since it didn’t test anything about that wheat. And the testThatRejectsEverything method failed both wheats and chaffs, so it’s simply not a correct example at all.)

These criteria are definitely in tension with one another: usefulness wants you to write as few test methods as possible; precision wants you to write enough test methods to distinguish among all the chaffs. Correctness is trivial if you write no tests; thoroughness is trivial if you reject everything. Balancing these four requirements takes some practice!

Correctness and thoroughness are the most important attributes we will focus on this semester. Accordingly, they will each be worth 40% of the grade for an Examplar submission. Precision and uniqueness are trickier to get right, so they will each be worth 10% of the grade for an Examplar submission. (On every assignment we give you, it will be possible to get 100% on all four attributes, but you may decide that it is not worth your time to hunt down the last few bits of uniqueness or precision.)

What Examplar grading looks like🔗

Our implementation of Examplar is integrated into Handins, and there will be dedicated homeworks for Examplar submissions. When you look at the grading feedback on Examplar submissions, each of the wheats and chaffs will be anonymized: Wheat #0, Chaff #12, etc. The numbers are stable: e.g., if you submit multiple times, then Chaff #12 will be the same chaff in every submission.

The feedback you see on Handins will consist of five boxes. The first four boxes correspond to the grading criteria above. They are color-coded: red means no credit on that criterion; yellow means partial credit; and green means full credit. Within each box, you’ll see a “progress bar” indicating how much partial credit you earned.

The grading box for Correctness will show you the (anonymized) names of the wheats, and which of your test methods (if any) rejected them. Your goal is to reject none of them:

The Thoroughness, Precision and Usefulness boxes will not show you any names; they’ll just show you the score you earned. All three of them rely on the same underlying data, which will be shown below.

After the four grading boxes, you’ll see a test matrix box in blue. The rows of this box correspond to each of our chaffs, and the columns correspond to each of your test methods. A marked cell indicates that a particular test method failed on a particular chaff.

When there are lots of chaffs and lots of test methods, it can be tricky to immediately see which ones are unique or not. To help with this, the table is interactive: you can click on a row (or column) name to highlight all rows (or columns) with the same markings as that one.

Hints for writing good Examplar examples🔗