Lecture 9.1: Abstract Syntax for a calculator expression language

In the latter half of this class, we are exploring various ways of reasoning about computation. First, we looked at using modules and abstract types to enforce program invariants; then, we looked at state machines, which allow us to randomly test and specify traces over interactive systems. Our final subject will be how we can look at programming languages themselves as objects of study --- and how we can enforce correctness by thinking about how the language is constructed.

Overview

Understand the difference between syntax and semantics (with an emphasis on syntax)
Introduce a small calculator language (Calc)
Parse Calc into abstract syntax trees

Syntax and semantics

Let's take a look at an OCaml program:

foo.ml

let foo (x : int) = x + 1

let _ = print_endline (string_of_int (foo 32))

When we input this program into utop, we can see that it evalues to "33":

> utop foo.ml
33

How did this happen? Very roughly speaking, there are three stages:

First, utop took the foo.ml file, which is just a flat string at this point, and parsed it into a tree. This process is defined by the syntax of the language: what well-formed programs look like.
Then, this tree was type checked: this ensures, for example, that we don't add a string to an integer.
Finally, the tree was executed, which caused the program to actually run. This process is defined by the semantics: a method of turning the tree into an actual execution.

Today, we are going to delve into the world of syntax. We will cover semantics later this week, and then cover type checking (which not all languages have) next week. We can get a glimpse into this tree by running utop with the -dparsetree flag:

utop # let foo (x : int) = x + 1;;

Ptop_def
  [
    structure_item (//toplevel//[1,0+0]..[1,0+25])
      Pstr_value Nonrec
      [
        <def>
          pattern (//toplevel//[1,0+4]..[1,0+7])
            Ppat_var "foo" (//toplevel//[1,0+4]..[1,0+7])
          expression (//toplevel//[1,0+8]..[1,0+25]) ghost
            Pexp_function
            [
              Pparam_val (//toplevel//[1,0+8]..[1,0+17])
                Nolabel
                None
                pattern (//toplevel//[1,0+8]..[1,0+17])
                  Ppat_constraint
                  pattern (//toplevel//[1,0+9]..[1,0+10])
                    Ppat_var "x" (//toplevel//[1,0+9]..[1,0+10])
                  core_type (//toplevel//[1,0+13]..[1,0+16])
                    Ptyp_constr "int" (//toplevel//[1,0+13]..[1,0+16])
                    []
            ]
            None
            Pfunction_body
              expression (//toplevel//[1,0+20]..[1,0+25])
                Pexp_apply
                expression (//toplevel//[1,0+22]..[1,0+23])
                  Pexp_ident "+" (//toplevel//[1,0+22]..[1,0+23])
                [
                  <arg>
                  Nolabel
                    expression (//toplevel//[1,0+20]..[1,0+21])
                      Pexp_ident "x" (//toplevel//[1,0+20]..[1,0+21])
                  <arg>
                  Nolabel
                    expression (//toplevel//[1,0+24]..[1,0+25])
                      Pexp_constant
                      constant (//toplevel//[1,0+24]..[1,0+25])
                        PConst_int (1,None)
                ]
      ]
  ]

val foo : int -> int = <fun>

We call this tree the AST (for Abstract Syntax Tree). Here, we use the term "abstract syntax" to distinguish it from "concrete syntax", which is the actual sequence of characters that make up the string "let foo (x : int) = x + 1". While more complicated, abstract and concrete syntax carry the same amount of information; indeed, the parser is meant to convert from concrete to abstract syntax.

The AST is very complicated, but in a sense, it is just a tree, in the same way that

type 'a tree = Leaf of 'a | Branch of 'a tree * 'a tree

is a tree. We can see the actual type for OCaml ASTs here. You can see that the OCaml AST is defined in OCaml itself; thus, the developers of OCaml are using OCaml to implement itself. (This is common for programming languages, and is called bootstrapping). This type for ASTs is very complicated, as OCaml is a fully-featured language. However, if you look, it does not use any language features we have not seen in class. The only exception is mutually-recursive types:

type even_int = Zero | Plus1 of odd_int
and odd_int = One | Plus1 of even_int

To some extent, every language behaves in this way. For example, in Python, we can look at the AST from within Python itself:

>>> import ast
>>> print(ast.dump(ast.parse('def id(x): return x'), indent=4))
Module(
    body=[
        FunctionDef(
            name='id',
            args=arguments(
                args=[
                    arg(arg='x')]),
            body=[
                Return(
                    value=Name(id='x', ctx=Load()))])])

You can explore how ASTs are used in various languages via this website, including OCaml.

Specifying the Syntax for a Calculator Language

Let's now explore what it means to define our own language. As we saw above, programming language syntax is defined using trees --- thus, to define a language, we first need to define the type of the AST. While "real" languages have very complicated trees, there is nothing stopping us from starting small.

We will define the tree for a language, Calc, for calculator expressions. Calc will support natural numbers, operators for + and *, and parentheses in order to group computations together. Some examples of well-formed programs for Calc includes (using OCaml-style comments just for illustration):

1       (* numeric constants *)
1 + 2   (* addition of subexpressions *)
(3 + 4) * 5 (* We use parens to make sure we do the multiplication first *)
3 + (4 * 5) (* This should result in a different tree than above *)
16 * ((2 + 2)) (* Multiple levels of parens is OK *)

Here are some ill-formed ones:

-1         (* No negation supported *)
2 2        (* Can't put expressions side-by-side like this *)
1 + ()    (* Parentheses must hold an expression inside *)

Here, we are using infix notation for + and *, but it's worth noting that we could have made other choices as well. Consider 1 + 2:

We could use generic function-like syntax: plus(1, 2)
Like Racket, we could use parentheses where the operator comes first: (+ 1 2)
Some stack-based languages do the opposite: 1 2 +

What are some pros / cons of these various choices?

Parsing and Abstract Syntax

Above, we informally specified the intended concrete syntax of our Calc language. Now, we can get formal, by defining a tree of expressions, which defines the abstract syntax:

type expr =
  | Nat of int
  | Add of expr * expr
  | Mul of expr * expr

Let's now see how we can informally map concrete syntax to abstract syntax:

(3 + 4) * 5

--->

Mul (Add(3, 4), 5)

Which can be drawn like so:

     '*'
     / \
   '+'  5
   / \
  3   4

Now, we can see how parentheses can matter. What about this one?

3 + (4 * 5)

-->

Add (3, Mul(4, 5))

As we can see, the way parentheses can used can dramatically alter the structure of the tree. If we were to not use parentheses, and instead follow PEMDAS:

1 + 2 * 3 + 4

-->

Add (1, Add (Mul (2, 3), 4))

One important note: even though the Nat constructor takes an int, that does NOT mean that "-1" should be a valid piece of concrete syntax. Instead, we are using int to hold natural numbers.

Creating a Parser

Now that we have our type of ASTs, expr, we can construct a function to parse. How to correctly and efficiently parse programming languages has a surprisingly deep literature! However, for us, parsing will be simple. A parser is just a function from strings to expr (that is allowed to throw an exception if the input cannot be parsed):

let parse (s : string) : expr = ...

Here, we will outline the basic ideas building a parser, which are illustrated in the companion .ml file. In a nutshell, parsing works like this: first, we conver the sequence of characters in the string to a list of tokens (like we see in HW8). Each token is a "logical unit" that should not be broken down further. For Calc, this means two things:

When we see a number with multiple digits (e.g., "1234"), we should think of this as a single token holding the integer 1234; and
excess whitespace should be ignored. Thus, 1 + 2 should produce the same tokens as 1 + 2.

For Calc and 1 + 2, the tokens will look something like:

type token = 
    | NAT of int
    | PLUS
    | TIMES
    | LPARENS
    | RPARENS
    | EOF (* End-of-file marker *)

Thus, the string "1 + (2 * (3))" will "tokenize" as [NAT 1; PLUS; LPARENS; NAT 2; TIMES; LPARENS; NAT 3; RPARENS; RPARENS].

Next, after converting the raw string into a list of tokens, we use a recursive algorithm to convert the list into a corresponding tree:

[NAT 1; PLUS; LPARENS; NAT 2; TIMES; LPARENS; NAT 3; RPARENS; RPARENS] 

-->

Add(1, Mul (2, 3)).

As you can see, the extra parentheses in the concrete syntax "1 + (2 * (3))" go away: indeed, they are there just to help the parser construct the intended tree. The approach in the sample parser we are using maintains two stacks: one of partial expressions we are building, and the other of pending operations to evaluate[1]. Iterating over our tokens, we enqueue operands to our operand stack or apply operations and evaluate expression to build up partial AST nodes in our output stack. At the end of this process, we end up with an empty operand stack and a single AST expression on our output stack.

Loading this all into utop:

utop # #use "9.1.ml";;

utop # parse "3";;
- : expr = Nat 3

utop # parse "(4 + 5) + 3";;
- : expr = Add (Add (Nat 4, Nat 5), Nat 3)

utop # parse "((4 + 5) + ((3) * 7))";;
- : expr = Add (Add (Nat 4, Nat 5), Mul (Nat 3, Nat 7))

The companion file also contains a function to pretty-print exprs:

val string_of_expr : expr -> string = <fun>

We can copy/paste the expressions produced from parsing above, to pretty-print the resulting AST nodes:

utop # string_of_expr (Nat 3);;
- : string = "3"

utop # string_of_expr (Add (Add (Nat 4, Nat 5), Nat 3));;
- : string = "(4 + 5) + 3"

utop # string_of_expr (Add (Add (Nat 4, Nat 5), Mul (Nat 3, Nat 7)));;
- : string = "(4 + 5) + (3 * 7)"

Together, there is a roundtrip property that holds where we can state

forall (e : expr), parse (string_of_expr e) = e

Exercise

What are some reasons that the statement, string_of_expr (parse s) = s, should not hold?

Bonus: Context-Free Grammars and BNF Grammars

(This part is only for the interested reader.)

Another way to specify the type of an AST for a programming language is not by an implementation in a particular language, like OCaml, but instead by specifying it on paper. We do this by writing something called a context-free grammar, or CFG. CFGs are used for more than just programming languages: indeed, CFGs (and various extensions of them) can also be used to analyze natural languages, and were invented for this purpose. For example, here is an AST for the sentence "I ride a bicycle":

      S
     / \
   NP   VP
   |   /  \
   I  V   NP
      |    / \
    ride  Det N
            | |
           a bicycle

Here, S stands for "sentence", while NP and VP stand for "noun phrase" and "verb phrase"; V is "verb", and N is "noun", while Det is "determiner" (e.g., words like "a" and "the").

To describe context-free grammars, like those that describe ASTs for programming languages, we use a particular form called a BNF grammar (standing for Backus-Naur, the two co-creators):

Natural n ∈ ℕ

Expression e
   := n
    | e₁ '+' e₂
    | e₁ '*' e₂
    | '(' e ')'

This grammar states the symbol n represents a "Natural" and is a valid element of the mathematical naturals ℕ. It also states that expressions e are valid if: they are naturals n, or if you have two expressions e₁ and e₂ with a symbol of + or * between them (which is infix notation). You can also surround any expression e with a left-paren ( before and a right-paren ) after the e.

Footnotes

[1] - Specifically, our implementation is based on the Shunting-yard algorithm.