Homework 8: Parsing CSVs
Due: Tuesday, November 4 11:59pm
Start with the template code here.
- Replace each
failwith "TODO"with your solution. - Be sure to test your solutions in
utopor in tests alongside your submission. To load your file intoutop, use#use "hw8.ml";;. - Submit your version of
hw8.mlto the HW8 assignment on Gradescope. Only submithw8.ml; NOT any other files or folders. - At the top of the template file, there is a comment for you to write in approximately how many hours you spent on the assignment. Filling this in is optional, but helpful feedback for the instructors.
A common format for storing and transmitting data is CSV, or comma-separated values. Suppose we want to manage a database consisting of names, ages, and eye colors. Then, we would do so in the following format:
CSV files are used ubiquitously to store data due to their simple format. In this homework, you will be writing a parser for CSVs, which takes in a database written in CSV format as input --- given as plain text --- and outputs a list of parsed rows (represented as OCaml lists) which later programs can then consume.
Moreover, our CSV parser will be streaming: meaning that the parser can handle CSV data coming in over time, and output well-parsed rows while still allowing more plain text to flow in. Because of this, your parser does not need to hold the entire CSV in memory, but instead only needs to remember single rows at a time. This is extremely important for certain usage scenarios, where real-time data coming from a sensor might easily generate hundreds of gigabytes of data over the course of a few hours.
Below, we will first give you a formal definition of the CSV format. Then, you will write your parser for the format in the form of a state machine. We will then work to specify the parser in terms of a serializer, which takes parsed CSV data as input and returns back raw text.
The CSV Format
Here, we give a formal description of the CSV format we will be working with. (This description is a bit simplified from the "real" CSV format, specified here).
Our CSV format will assume we have a fixed number, n, of columns. (In the above example, n = 3.) Our CSV format is comprised of rows, values, and tokens.
Rows: Each CSV consists of a number of rows. Each row must be separated by a newline (given by the OCaml character written '\n'.) Additionally, the last row of a CSV must end with a newline.
Values: Each row consists of exactly n values, stored as a list. Values are separated by a comma (,).
Tokens: Each value consists of a list of tokens, which can either be a single character (e.g., 'c') or an escaped string.
If the token is a single character, it CANNOT be a comma (,), newline \n or the character '^', which we use for escaping.
An escaped string is one that starts and ends with '^', and can contain any character EXCEPT '^'. Escaped strings CAN contain newlines and commas. An example of an escaped string is below:
Note that in OCaml (like other systems) we make reference to some characters by a backslash -- such as '\"' for double quotes. Additionally, whitespace, such as '\n' (for newline) and ' ' (for space) are valid OCaml characters. The only special characters in our CSV format are: '\n' (used for separating rows); ',' (used for separating values); and '^' (used for beginning/ending escaped strings).
In particular, the characters ' ', '\t', and '\"' are treated just like any other character.
Below are the relevant OCaml types:
type token = | TokChar of char (* Any character except `,`, `\n`, and `^` *)
| TokEscaped of string (* string cannot contain character ^ *)
type value = token list (* Can be arbitrarily long *)
type row = value list (* must have length n, where n is number of columns *)
type csv = row list (* Only used if we are talking about the ENTIRE CSV (not when we make the state machine) *)
Below is an example of a valid CSV, where n = 3:
csv:
[
(* Row 1 *)
[ (* Value 1 *)
[TokChar 'x'; TokChar 'y'; TokChar 'z'];
(* Value 2 *)
[TokChar '4'; TokChar '2'];
(* Value 3 *)
[TokChar 'h'; TokChar 'e'; TokChar 'l'; TokChar 'l'; TokChar 'o']
];
(* Row 2 *)
[
(* Value 1 *)
[TokChar 'f'; TokChar 'o'; TokChar 'o'; TokEscaped "\n\n\"hi\", there\n"];
(* Value 2 *)
[TokChar '3'; TokChar '7'];
(* Value 3 *)
[TokChar 'g'; TokChar 'o'; TokChar 'o'; TokChar 'd'; TokChar 'b'; TokChar 'y'; TokChar 'e'];
];
(* Row 3 *)
[
(* Value 1 *)
[TokChar 'a'; TokChar 'b'; TokChar 'c'];
(* Value 2 *)
[TokChar ' '; TokChar 'f', TokChar 'g'];
(* Value 3 *)
[TokChar ' '; TokChar ' '; TokChar ' '; TokExcaped ",,,"; TokChar '9'; TokChar '9']
]
]
Note that a value can be empty, so the following is a valid CSV:
which parses as[
(* Row 1 *)
[
(* Value 1 *)
[TokChar 'a'];
(* Value 2 *)
[TokChar 'b'];
(* Value 3 *)
[TokChar 'c']
];
[
(* Value 1 *)
[];
(* Value 2 *)
[];
(* Value 3 *)
[]
];
[
(* Value 1 *)
[TokChar 'd'];
(* Value 2 *)
[TokChar 'e'];
(* Value 3 *)
[TokChar 'f'];
]
]
Below are a few non-valid CSVs, with n = 3:
, after three values)
(not valid since we saw a newline (\n) before three values)
Q1: Checking well-formedness for CSVs
Given the above data types for CSVs, write a function
That checks whether the CSV structure c is valid. Here, "valid" means that:
- Each row has exactly
nvalues; and - Each value has only well-formed tokens.
A token is well-formed when:
- For
TokChar c, thatcis not'\n','^', or','; and - For
TokEscaped s, thatsdoes not contain'^'.
To handle TokEscaped s (for this problem and others), you will find it useful to convert between strings and lists of characters:
let string_of_char_list (cs : char list) : string =
String.of_seq (List.to_seq cs)
let char_list_of_string (s : string) : char list =
List.of_seq (String.to_seq s)
Q1: Grading Criteria
The correctness of well_formed_csv.
Q2: Parsing CSVs
Included in hw8.ml is our type for a state machine:
In this problem, you will design a state machine to parse CSVs. It will work as follows. The input type to the state machine will be a single character:
while the output will either be:Ok (representing that it correctly consumed an input char); Error (representing an error in parsing); or RowFinished r, where r : row:
Note that your machine only needs to output a row as soon as it's parsed (i.e., when we see a newline \n). Thus, the machine does NOT need to keep track of the entire CSV; just the current row that's being processed.
If any character is input that causes the CSV to become invalid, you must output Error and continue to output Error on all further inputs. If a character is input that does not cause an error or finish a valid row, you must output Ok.
Implement your state machine:
type csv_state = unit (* Implement with your own state type! *)
let csv_step (st : csv_state) (i : csv_input) = failwith "TODO"
let csv_machine (n : int) : (csv_state, csv_input, csv_output) machine = {
init = ();
step = csv_step;
}
The csv_machine function takes an integer n as input, which is the number of columns for each row. You may assume n is at least two.
Q2: Grading Criteria
Whether your state machine correctly parses and emits rows for CSVs.
Hint
Your CSV parser must keep track of multiple kinds of state. In particular, at least:
- Whether an error has happened;
- The static value
nfor the number of columns; - Whether we are currently processing an escaped string, and the accumulated characters for it;
- The value we are currently accumulating; and
- The row we are currently accumulating, including how many values it already contains.
To test your machine, the starter code includes a function
which will let you interactively input CSV data, and observe the output of your machine.Q3: Testing the Parser
We can now move on to specifying and testing the parser. First, we need a way to run the parser on an entire CSV input, rather than character by character. We can do this by iteratively running the machine, and collecting the rows that are output:
let make_parsed_csv (m : ('s, csv_input, csv_output) machine) (cs : char list) : csv option =
let rec go (curr : 's) (cs : char list) : csv option =
match cs with
| [] -> Some []
| c :: cs' ->
let (next, o) = m.step curr c in
match o with
| Ok -> go next cs'
| Error -> None
| RowFinished r ->
match go next cs' with
| None -> None
| Some c -> Some (r :: c)
in
go m.init cs
make_parsed_csv returns a csv option, because the state machine might fail and return Error at any point.
Also note that additional data after the last finished row is ignored; thus, we will only run make_parsed_csv on char lists that end in a newline.
Now, we can specifify our parser. We will do so by evaluating it against the "inverse" of parsing --- serializing --- which takes a parsed value of type csv and converts it back into a list of characters.
let string_of_csv (c : csv) =
String.concat "\n" (List.map string_of_row c)
let serialize_csv (c : csv) =
char_list_of_string (string_of_csv c) @ ['\n']
csvs ends with a newline, to match the behavior of our state machine.
Finally, given our parser and serializer, we can evaluate the parser by proving two round-trip properties, which state that parsing and serialization are inverses. First, if we have a particular input sequence that parses correctly --- and if it ends in a newline --- then when we serialize it back to an input sequence, we get where we started:
let last (xs : 'a list) : 'a option =
if List.is_empty xs then None else Some (List.nth xs (List.length xs - 1))
let parse_serialize_prop (m : ('s, csv_input, csv_output) machine) (cs : char list) =
if last cs = Some '\n' then
match make_parsed_csv m (cs @ ['\n']) with
| None -> true
| Some parsed ->
let cs' = serialize_csv parsed in
cs = cs'
else
true
Next, we can go the other way: if we serialize, and then parse, we should get back to where we started --- given that the csv value is well-formed.
let serialize_parse_prop (m : ('s, csv_input, csv_output) machine) (n : int) (c : csv) =
if well_formed_csv n c then
make_parsed_csv m (serialize_csv c)
=
Some c
else true
Note for both parse_serialize_prop and serialize_parse_prop, you must pass in csv_machine n for the first argument, where n is the number of columns.
Intuitively, parse_serialize_prop and serialize_parse_prop together test that your parser is correct relative to the serializer. Namely, that
- Every well-formed
csvvalue can be parsed after serialization; and - No two distinct lists of inputs to your state machine can result in the same parsed
csv.
Testing your Parser with PBT
In this question, your job is to robustly use property-based testing --- using forall, which is in hw8.ml ---
to test whether it is true that parse_serialize_prop holds for all inputs:
serialize_parse_prop holds for all inputs:
To test these properties, you must:
- Design a suite of generators for inputs for both properties (so, both of type
list charandcsv); - Check whether both of the above properties are true when tested with your generators, and write your findings in a comment; and
- Argue, in the form of comments provided next to your tests, that your generators are sufficiently testing the validity of the above theorems. In particular, giving a detailed description of what kinds of inputs cause the properties to be true, and what kinds of inputs cause the property to be false.
For parse_serialize_prop, you must randomly generate appropriate lists of characters. Since parse_serialize_prop requires that the last character is a newline, your generators must enforce this property in order for the property to not be trivial. Moreover, your generators should be able to generate both inputs that will parse correctly under your parser; and those which will cause your parser to output Error.
For serialize_parse_prop, your generated generator(s) for csv must return well-formed csvs, and similarly exercise various parts of the system, including: multiple rows; empty values; tokens containing escaped strings; and so on.
For both properties, you should also randomly generate the number of columns, n, and design your generators for char list and csv to take this n into account. You can do this by generating pairs, where the second element depends on the first:
let gen_prop2_pair : unit -> (int * csv) =
fun _ ->
let n = random_num_of_columns () in
let c = generate_csv n in
(n, c)
Note that we are not telling you whether Property 1 and 2 are both true or not for all inputs; this is something you should investigate!
Q3: Grading Criteria
Whether your suite of tests:
- uses
forallto test the validity of the above two universally quantified propositions; - includes a robust test suite of randomly generated inputs; and
- includes a report, written in a comment, that describes the results of your testing, why your random generators were good choices, and for what kinds of inputs you found caused the properties to be true or false.
Note: in addition to interactively testing your machine using run_csv_machine and randomly testing it using generators, you can also test your code by writing CSV values inline using OCaml's support for multiline strings, which begin with {| and end with |}:
d, e, f) up to where a, b, c starts, this would count as extra whitespace.