Lecture 3.2: Working with Specifications

Last class, we discussed what specifications are; now, we will discuss what we can do with them. Let's take a look at the spec for sort : int list -> int list, which has two parts:

(P1) The result should be sorted: if \(\mathsf{sort}(xs) = ys\), then \(\forall 0 \leq i < j < |ys|, ys[i] \leq ys[j]\).
(P2) The result should have the same number of each element: if \(\mathsf{sort}(xs) = ys\), then \(\forall i, \mathsf{count}(xs, i) = \mathsf{count}(ys, i)\), where \(\mathsf{count}(xs, i)\) counts the number of occurences of \(i\) in \(xs\).

Formalizing a spec

To do anything interesting with this specification, we will first formalize it by encoding it into a programming language.

let rec all (p : int -> bool) (xs : int list) =
    match xs with
    | [] -> true
    | x :: xs' -> p x && all p xs'


let rec sorted (xs : int list) = 
    match xs with 
    | [] -> true
    | _ :: [] -> true
    | i :: j :: xs' -> i <= j && sorted (j :: xs')


let rec count (xs : int list) (y : int) =
    match xs with
    | [] -> 0
    | x::xs' -> (if x = y then 1 else 0) + count xs' y

let equal_elements (xs : int list) (ys : int list) = 
    all (fun (x : int) -> 
        count xs x = count ys x
    ) xs
    &&
    all (fun (x : int) -> 
        count xs x = count ys x
    ) ys

let sort_spec (xs : int list) (sort : int list -> int list) = 
    let ys = sort xs in 
    sorted ys && equal_elements xs ys

We made some specific coding choices above that made our two logical into a computable check that can actually be run. Indeed, sort_spec xs sort returns true if and only if sort returns the correct answer on xs. There are some tricky points here.

Why is our sorted function the same as our predicate for the list being sorted?
What about equal_elements?

What to do with a formalized spec?

What we have done above by defining sort_spec is show that, for a given list xs, we can decide whether sort behaves correctly on that list xs. Correctness of an algorithm on a specific input is usually easy to directly check. What's harder is showing that sort behaves correctly on all inputs:

\[ \forall xs \in \texttt{int list}, \texttt{sort_spec}(xs, sort). \]

In general, questions like this one above is undecidable: no computer program, given the code for sort, can automatically compute with absolute certainty whether the above formula is true. This is mostly due to the fact that int list is infinite (modulo restrictions in OCaml about the sizes of integers), so we can't exhaustively check every case. However, even though this problem is undecidable, there are still a lot of things you can do.

Decidability

We can generalize the above discussion to talk about decidability in general. Given a set \(S\) (encoded via logical specs), the decision problem for \(S\) is the task of asking whether a given function \(f\) is inside of \(S\) or not. The decision problem for \(S\) is decidable if there exists a program decide that, given \(f\), can always output a correct yes/no answer for whether \(f \in S\). A problem is undecidable if there does not exist such a program. Thus, this set is decidable:

\[ S = \{(xs, f) \mid \texttt{sort_spec}(xs, f) \}\]

while this one isn't:

\[ S = \{f \mid \forall xs, \texttt{sort_spec}(xs, f) \}.\]

Property-Based Testing

Given a specification \(S\) given via logical specs, we can use property-based testing, or PBT, to test whether a given function \(f\) is in \(S\). The essence of PBT is to reinterpret all universal quantifiers as probabilistic checks, where we sample from a distribution (often many times). While we can't decide if \(f \in S\) with absolute certainty, PBT still allows us to automatically find counter-examples with minimal effort.

Since our logical spec for sort is \(\forall xs, \texttt{sort_spec}(xs, \texttt{sort})\), we see we have one quantifier that we need to sample for. We sample this quantifier using a generator:

let gen_uniform_list (len_bound : int) : unit -> int list = fun () ->  
    let len = Random.int len_bound in 
    (* Standard library function to create lists using a function *)
    List.init len (fun _ -> Random.int 100)

A note about Random

In OCaml, Random.int generates random values based on a pseudorandom number generator (PRNG), which creates random values given a seed. If you use Random.int (or any other function in Random) and you want it to actually be random, you need to call Random.self_init () first. Alternatively, you can fix the seed by calling Random.init 32.

In general, a generator for t will have type unit -> t; thus, we can think of gen_uniform_list as being a function from integers (for the length bound) to generators for int list. We now create a sampler for the quantifier, which we will suggestively call forall:

let rec forall_list (gen : unit -> int list) (prop : int list -> bool) (trials : int) : (int list) option =
    if trials <= 0 then     
        None (* No counter-example found *)
    else 
        let xs = gen () in 
        if prop xs then 
            forall_list gen prop (trials - 1)
        else Some xs

Above, we are using the OCaml standard library functions for List. We will use them more going forward. If all tests succeed, we return None; else, we return Some xs, where xs is the counter-example of correctness. Additionally, we are abstracting over the generator we will use for the input.

Testing a sort function

Now that we have set things up, we can test whether a particular function is a good sorting algorithm. Let's use insertion sort:

let rec insert (x : int) (xs : int list) : int list =
  match xs with
  | [] -> [x]
  | y :: ys -> if x <= y then x :: y :: ys else y :: insert x ys

let rec isort (xs : int list) =
  match xs with
  | [] -> []
  | y :: ys -> insert y (isort ys)

Now, to test it:

utop# forall_list (gen_uniform_list 10) (fun xs -> 
    sort_spec xs isort
) 32;;
- : int list option = None

utop# let bozo_sort (xs : int list) = xs ;; 
val bozo_sort : int list -> int list

utop # forall_list (gen_uniform_list 5) (fun xs -> sort_spec xs bozo_sort) 32;;
- : int list option = Some [62; 10; 65; 96]

We can see if we introduce a bug into isort, PBT can catch it:

let rec insert' (x : int) (xs : int list) : int list =
  match xs with
  | [] -> [x]
  | y :: ys -> if x <= y then y :: x :: ys else y :: insert x ys

let rec isort' (xs : int list) =
  match xs with
  | [] -> []
  | y :: ys -> insert' y (isort' ys)

Unsoundness of PBT

While PBT is easy to use, its limitation is that it's unsound: given a function \(f\), PBT may say that \(f \in S\) (since all tests pass), but it might not actually be true that \(f \in S\). Let's explore unsoundness with an example:

let sort : int list -> int list = List.sort Int.compare ;;

let bad_sort (xs : int list) = 
    if List.length xs >= 300 && sorted xs then 
        []
    else 
        sort xs

What will PBT say about bad_sort, if we use the gen_list generator? In all likelihood, it will not catch any bug because it's extremely unlikely to generate a list of length 300 that is sorted.

utop # forall (gen_uniform_list 500) (fun xs -> sort_spec xs bad_sort) 1000;;
- : int list option = None

What this shows us is that PBT is as only as good as the generators. Let's create a new generator that will (hopefully) catch the bug:

let gen_sorted (len : int) : unit -> int list = fun () -> 
    let rec aux (k : int) : int list * int = 
        if k <= 0 then ([], 0) else 
            let (xs, last) = aux (k - 1) in 
            let x = last + Random.int 10 in 
            (xs @ [x], x)
    in 
    fst (aux (Random.int len))

Here, gen_sorted is a random generator that only outputs sorted lists. We can then catch the bug by using gen_sorted instead of gen_list.

utop # forall_list (gen_sorted 500) (fun xs -> sort_spec xs bad_sort) 10;;
- : int list option = Some [3; 6; ...]

If we wanted, we could even fold both generators into one, by flipping a coin and choosing one to run:

let one_of (f : unit -> int list) (g : unit -> int list) (len : int) : unit -> int list = 
    if Random.int 10 <= 7 then 
        gen_uniform_list len
    else 
        gen_sorted len

Which generator to use?

Above, we only caught the bug in bad_sort because we knew it was testing for a sorted input (and thus we created a generator for sorted lists). This shows us two things:

Uniform random inputs do not catch all corner cases --- especially corner cases that assume a lot of "structure" on the inputs (such as the input being sorted).
To deploy PBT, sometimes it is useful to inspect the implementation of the code you're analyzing to learn about which generators to use. If you see that the code is testing the input for property X, it might be good to generate inputs that satisfy property X.

Aside: Program Verification

Another route to checking sort works on all inputs is to mathematically prove it, by using proof assistants, such as Rocq, Lean, ACL2, and Dafny, among many others. These tools are incredibly powerful, as they let you actually prove that sort is correct for all inputs --- not just probabilistically check it is correct for some. The tradeoff, though, is that these tools are quite a bit more involved to use. Indeed, the user has to often write many proof annotations in addition to their functions and specifications. Indeed, very often the user needs to write more "proof annotations" than actual code by a 5:1 ratio (or even higher!).

While we won't discuss program verification more in this class, the art of choosing a specification, and encoding it into a programming language, is necessary for both PBT and verification (and is quite useful even if you aren't testing).

Another example for unsoundness

Is it true that \(x \leq x + 1\) if \(x\) is an OCaml int?

Let's see if it is:

utop # 
let test (gen_int : unit -> int) = 
    let i = gen_int () in 
    i <= i + 1 ;;
- : (unit -> int) -> bool

utop # test (fun _ -> Random.int 500);;
- : bool = true (* This will always return true *)

But if we change the distribution :

utop # let gen_large () = 
    4611686018427387903 - Random.int 5;;
- : bool = false (* Uh-oh! *)

This is because OCaml ints have a maximum value, and wrap around if we exceed that maximum value:

utop # 4_611_686_018_427_387_903 + 1;;
- : int = -4611686018427387904

Whether this matters will depend heavily on your use case. Are you using int to store quantities in an inventory system? You likely won't hit this number (around 4000 trillion). But if you are sampling unique integers from the space of all integers, then you need to be careful!