Lecture 9.1: Abstract Syntax for a calculator expression language
In the latter half of this class, we are exploring various ways of reasoning about computation. First, we looked at using modules and abstract types to enforce program invariants; then, we looked at state machines, which allow us to randomly test and specify traces over interactive systems. Our final subject will be how we can look at programming languages themselves as objects of study --- and how we can enforce correctness by thinking about how the language is constructed.
Overview
- Understand the difference between syntax and semantics (with an emphasis on syntax)
- Introduce a small calculator language (Calc)
- Parse Calc into abstract syntax trees
Syntax and semantics
Let's take a look at an OCaml program:
When we input this program into utop, we can see that it evalues to "33":
How did this happen? Very roughly speaking, there are three stages:
- First,
utoptook thefoo.mlfile, which is just a flat string at this point, and parsed it into a tree. This process is defined by the syntax of the language: what well-formed programs look like. - Then, this tree was type checked: this ensures, for example, that we don't add a string to an integer.
- Finally, the tree was executed, which caused the program to actually run. This process is defined by the semantics: a method of turning the tree into an actual execution.
Today, we are going to delve into the world of syntax. We will cover semantics later this week, and then cover type checking (which not all languages have) next week.
We can get a glimpse into this tree by running utop with the -dparsetree flag:
utop # let foo (x : int) = x + 1;;
Ptop_def
[
structure_item (//toplevel//[1,0+0]..[1,0+25])
Pstr_value Nonrec
[
<def>
pattern (//toplevel//[1,0+4]..[1,0+7])
Ppat_var "foo" (//toplevel//[1,0+4]..[1,0+7])
expression (//toplevel//[1,0+8]..[1,0+25]) ghost
Pexp_function
[
Pparam_val (//toplevel//[1,0+8]..[1,0+17])
Nolabel
None
pattern (//toplevel//[1,0+8]..[1,0+17])
Ppat_constraint
pattern (//toplevel//[1,0+9]..[1,0+10])
Ppat_var "x" (//toplevel//[1,0+9]..[1,0+10])
core_type (//toplevel//[1,0+13]..[1,0+16])
Ptyp_constr "int" (//toplevel//[1,0+13]..[1,0+16])
[]
]
None
Pfunction_body
expression (//toplevel//[1,0+20]..[1,0+25])
Pexp_apply
expression (//toplevel//[1,0+22]..[1,0+23])
Pexp_ident "+" (//toplevel//[1,0+22]..[1,0+23])
[
<arg>
Nolabel
expression (//toplevel//[1,0+20]..[1,0+21])
Pexp_ident "x" (//toplevel//[1,0+20]..[1,0+21])
<arg>
Nolabel
expression (//toplevel//[1,0+24]..[1,0+25])
Pexp_constant
constant (//toplevel//[1,0+24]..[1,0+25])
PConst_int (1,None)
]
]
]
val foo : int -> int = <fun>
We call this tree the AST (for Abstract Syntax Tree). Here, we use the term "abstract syntax" to distinguish it from "concrete syntax", which is the actual sequence of characters that make up the string "let foo (x : int) = x + 1".
While more complicated, abstract and concrete syntax carry the same amount of information; indeed, the parser is meant to convert from concrete to abstract syntax.
The AST is very complicated, but in a sense, it is just a tree, in the same way that
is a tree. We can see the actual type for OCaml ASTs here. You can see that the OCaml AST is defined in OCaml itself; thus, the developers of OCaml are using OCaml to implement itself. (This is common for programming languages, and is called bootstrapping). This type for ASTs is very complicated, as OCaml is a fully-featured language. However, if you look, it does not use any language features we have not seen in class. The only exception is mutually-recursive types:To some extent, every language behaves in this way. For example, in Python, we can look at the AST from within Python itself:
>>> import ast
>>> print(ast.dump(ast.parse('def id(x): return x'), indent=4))
Module(
body=[
FunctionDef(
name='id',
args=arguments(
args=[
arg(arg='x')]),
body=[
Return(
value=Name(id='x', ctx=Load()))])])
Specifying the Syntax for a Calculator Language
Let's now explore what it means to define our own language. As we saw above, programming language syntax is defined using trees --- thus, to define a language, we first need to define the type of the AST. While "real" languages have very complicated trees, there is nothing stopping us from starting small.
We will define the tree for a language, Calc, for calculator expressions. Calc will support natural numbers, operators for + and *, and parentheses in order to group computations together. Some examples of well-formed programs for Calc includes (using OCaml-style comments just for illustration):
1 (* numeric constants *)
1 + 2 (* addition of subexpressions *)
(3 + 4) * 5 (* We use parens to make sure we do the multiplication first *)
3 + (4 * 5) (* This should result in a different tree than above *)
16 * ((2 + 2)) (* Multiple levels of parens is OK *)
-1 (* No negation supported *)
2 2 (* Can't put expressions side-by-side like this *)
1 + () (* Parentheses must hold an expression inside *)
Here, we are using infix notation for + and *, but it's worth noting that we could have made other choices as well. Consider 1 + 2:
- We could use generic function-like syntax:
plus(1, 2) - Like Racket, we could use parentheses where the operator comes first:
(+ 1 2) - Some stack-based languages do the opposite:
1 2 +
What are some pros / cons of these various choices?
Parsing and Abstract Syntax
Above, we informally specified the intended concrete syntax of our Calc language. Now, we can get formal, by defining a tree of expressions, which defines the abstract syntax:
Let's now see how we can informally map concrete syntax to abstract syntax:
Which can be drawn like so:Now, we can see how parentheses can matter. What about this one?
As we can see, the way parentheses can used can dramatically alter the structure of the tree. If we were to not use parentheses, and instead follow PEMDAS:
One important note: even though the Nat constructor takes an int, that does NOT mean that "-1" should be a valid piece of concrete syntax. Instead, we are using int to hold natural numbers.
Creating a Parser
Now that we have our type of ASTs, expr, we can construct a function to parse. How to correctly and efficiently parse programming languages has a surprisingly deep literature! However, for us, parsing will be simple. A parser is just a function from strings to expr (that is allowed to throw an exception if the input cannot be parsed):
Here, we will outline the basic ideas building a parser, which are illustrated in the companion .ml file.
In a nutshell, parsing works like this:
first, we conver the sequence of characters in the string to a list of tokens (like we see in HW8). Each token is a "logical unit" that should not be broken down further. For Calc, this means two things:
- When we see a number with multiple digits (e.g.,
"1234"), we should think of this as a single token holding the integer1234; and - excess whitespace should be ignored. Thus,
1 + 2should produce the same tokens as1 + 2.
For Calc and 1 + 2, the tokens will look something like:
Thus, the string "1 + (2 * (3))" will "tokenize" as [NAT 1; PLUS; LPARENS; NAT 2; TIMES; LPARENS; NAT 3; RPARENS; RPARENS].
Next, after converting the raw string into a list of tokens, we use a recursive algorithm to convert the list into a corresponding tree:
As you can see, the extra parentheses in the concrete syntax "1 + (2 * (3))" go away: indeed, they are there just to help the parser construct the intended tree.
The approach in the sample parser we are using maintains two stacks: one of partial expressions we are building, and the other of pending operations to evaluate[1]. Iterating over our tokens, we enqueue operands to our operand stack or apply operations and evaluate expression to build up partial AST nodes in our output stack. At the end of this process, we end up with an empty operand stack and a single AST expression on our output stack.
Loading this all into utop:
utop # #use "9.1.ml";;
utop # parse "3";;
- : expr = Nat 3
utop # parse "(4 + 5) + 3";;
- : expr = Add (Add (Nat 4, Nat 5), Nat 3)
utop # parse "((4 + 5) + ((3) * 7))";;
- : expr = Add (Add (Nat 4, Nat 5), Mul (Nat 3, Nat 7))
The companion file also contains a function to pretty-print exprs:
We can copy/paste the expressions produced from parsing above, to pretty-print the resulting AST nodes:
utop # string_of_expr (Nat 3);;
- : string = "3"
utop # string_of_expr (Add (Add (Nat 4, Nat 5), Nat 3));;
- : string = "(4 + 5) + 3"
utop # string_of_expr (Add (Add (Nat 4, Nat 5), Mul (Nat 3, Nat 7)));;
- : string = "(4 + 5) + (3 * 7)"
Together, there is a roundtrip property that holds where we can state
Exercise
What are some reasons that the statement, string_of_expr (parse s) = s, should not hold?
Bonus: Context-Free Grammars and BNF Grammars
(This part is only for the interested reader.)
Another way to specify the type of an AST for a programming language is not by an implementation in a particular language, like OCaml, but instead by specifying it on paper. We do this by writing something called a context-free grammar, or CFG. CFGs are used for more than just programming languages: indeed, CFGs (and various extensions of them) can also be used to analyze natural languages, and were invented for this purpose. For example, here is an AST for the sentence "I ride a bicycle":
Here, S stands for "sentence", while NP and VP stand for "noun phrase" and "verb phrase"; V is "verb", and N is "noun", while Det is "determiner" (e.g., words like "a" and "the").
To describe context-free grammars, like those that describe ASTs for programming languages, we use a particular form called a BNF grammar (standing for Backus-Naur, the two co-creators):
This grammar states the symbol n represents a "Natural" and is a valid element of the mathematical naturals ℕ. It also states that expressions e are valid if: they are naturals n, or if you have two expressions e₁ and e₂ with a symbol of + or * between them (which is infix notation). You can also surround any expression e with a left-paren ( before and a right-paren ) after the e.
Footnotes
[1] - Specifically, our implementation is based on the Shunting-yard algorithm.