Lecture 5: Runtime representations for multiple value types
We ought to be able to handle more kinds of values than just numbers. Let’s try to add at least one more type of value to our language. Doing so will trigger a large number of problems, which will all need to be resolved to make things work again.
1 Growing the language: adding boolean values
When we introduced conditionals and branching control-flow last time,
we were a bit dissatisfied with the semantics of “branch if it’s nonzero
or not” —
Do Now!
What new primitive operations should we add to operate with booleans? What should happen if we give numeric arguments to boolean operations? Or vice versa?
Exercise
We currently represent numbers as themselves. How can we represent booleans?
Exercise
How can we prevent things from going wrong?
The key challenge, from which all the other problems today will arise, is that our current encoding of numbers is too big. We’ve assigned a numeric meaning to every 32-bit pattern, and therefore used them all up: there are no bit patterns left to represent boolean values. Somehow, we need to carve out space for our booleans. As soon as we do that, though:
Not all operations should operate on all bit patterns, which means we have to ensure that operations can distinguish valid operands from invalid ones.
Not all patterns can be considered numbers, which means we have to ensure that our compilation of numbers makes sense in this new extended setup.
It becomes apparent that just supporting booleans is shortsighted: we’d like to be able to broaden these techniques to other datatypes later on, such as pointers, tuples, structures, closures, or others.
A lot of work, for just two values! But that work forms a blueprint to follow when growing the language further.
1.1 Wait, what about static typing?
Surely we can resolve some if not all of these problems by designing a type-checker for our language? After all, if we can determine statically that the program never miuses an operation with the wrong type of operands, then we don’t need to distinguish those types of values at runtime?
Sadly, not quite: while some operations will be simplified if we do this, there are still features we’d like to have in our compiler, such as garbage collection, that will inevitably need to treat some types of values differently than others. Without getting into the details now (we’ll save that for later), learning how to deal with distinguishing values now, while the language is still small, will be good practice when the language gets more complicated soon.
1.2 The new concrete syntax
Today’s language changes themselves are syntactically quite simple: only two new expression forms, and a bunch of new operators. Notice that we have operators that consume and return booleans, and also operators that consume numbers and return booleans – language features interact.
1.3 Examples
Exercise
Develop a suite of test examples for using booleans. Be thorough, and consider as many edge cases as possible.
1.4 Enhancing the abstract syntax
The abstract syntax changes are similarly straightforward to the concrete ones.
type prim1 = ... | Not
type prim2 = ... | And | Or | Less | Greater | LessEq | GreaterEq | Eq | Ne
type 'a expr = ... | EBool of bool * 'a
Like numbers, we will consider EBool _
values to be immediates for the
purposes of ANF: no further computation on them is needed.
2 Representation choices
If all of our bit-patterns so far are used for numbers, then when we choose some patterns to reinterpret as booleans, we must have some way of distinguishing boolean (or other) patterns from numeric ones.
A note about notation: since we will be discussing actual bit-level representations today, it will be useful to use either binary or hexadecimal notations for numbers, rather than more common decimal notation. I.e.
0b00000000_00000000_00000000_00001010
= 0x0000000A
= 10
For clarity, if needed, we’ll include underscores to separate the bytes in the
binary representation. Further, if there are bits that we don’t care about, we
will mark them as an x
in the binary representation.
2.1 Attempt 1: Why limit ourselves to 32-bits?
If all the patterns in a 32-bit word are used up...let’s use more words.
Suppose we encode all our values as two words, where the first word is a
tag word describing what type of value comes next —
Do Now!
List at least three consequences of this approach.
On the one hand, this approach is certainly future-proof: we have room to
distinguish 4 billion distinct types of data in our programs! But on the
other, consider how we would have to implement any of our operations. Because
our values are now larger than our word-size, they won’t fit into registers any
more, which in turn means that our invariant of “the answer always goes in
EAX
” won’t work any longer. Instead, we will need to always put our
answers on the stack, and then put the address of the first word of the
answer into EAX
: i.e., the actual answer can be found in
[EAX]
and [EAX - 4]
. This in turn means that every single
operation we want to perform will require at least two memory reads and writes
—
2.2 Attempt 2: We don’t need 32 bits for a tag
If having 32 tag bits is overkill, let’s try the minimum needed: could we get away with a single tag bit? We can pick any bit we like for this, but for reasons that will become apparent soon, let’s choose the least-significant bit (LSB) to denote numbers versus booleans:
LSB =
0
implies the value is a numberLSB =
1
implies the value is a boolean
This may still feel like overkill: we have 2 billion bit patterns reserved just for booleans, and we’ve given up half of our available space for numbers. In practice, we’ll accept the loss of range on numbers for now, and we’ll refine our notion of what it means to “tag a boolean” so that we can use the rest of that reserved space for other types of values, later.
2.2.1 Booleans
We now need to choose representations for true
and false
themselves.
Since it’s very likely that tags are going to get more complex later, we
probably want to reserve several low-order bits for future tag use —
MSB =
0
encodesfalse
MSB =
1
encodestrue
In other words, our true and false values will therefore be
0b1xxxxxxx_xxxxxxxx_xxxxxxxx_xxxxxxx1
and
0b0xxxxxxx_xxxxxxxx_xxxxxxxx_xxxxxxx1
, respectively. We can choose
whatever bits we want in between: two obvious choices would be either all zeros
or all ones.
2.2.2 Numbers
How should we represent numbers, then? Since we’ve decided that an LSB of zero indicates a number, we effectively have all the bit-patterns-formerly-interpreted-as-even-numbers available to us as numbers. This lends itself to a convenient encoding: we’ll simply represent the number n as the bit-pattern that used to represent 2n:
Number |
| Representation |
1 |
|
|
6 |
|
|
42 |
|
|
This is clearly a one-to-one mapping, so it’s a faithful encoding; let’s see if it’s a useful one, too.
2.3 Compiling constants
Compiling our new constants is fairly straightforward. There is a twist: we need to ensure at compile time that any numeric constants we have in our source program aren’t inherently overflowing:
let compile_imm e env =
match e with
| ENumber(n, _) ->
(* (2^30 - 1) -(2 ^ 30) *)
if n > 1073741823 || n < -1073741824 then
failwith ("Integer overflow: " ^ (string_of_int n))
else
Const(n lsl 1)
| EBool(true, _) -> const_true
| EBool(false, _) -> const_false
...
lsl
operator logically shifts left the bits
of its first operand by some number of places given by the second operand. The
newly vacated bits are filled in with zeros. As a consequence,
(n lsl 1) == 2 * n
...but since we’re contemplating these values as bit
patterns, it’s a useful observation to notice that multiplying by two simply
slides the bits around: perhaps we can rely on that later to unslide them as needed.Do Now!
It’s worth pausing to confirm that
lsl
does have the desired semantics in all cases, especially whenn
is negative. Given that our numbers use two’s complement representation, what can we determine about the most significant bits of any negative number in our valid range of numbers? Given that, what we can we deduce about the behavior oflsl
?
The const_true
and const_false
values are whatever bit patterns we
chose above; for legibility, we ought to define them as named constants in our
compiler, rather than scatter magic numbers throughought our code.
Exercise
Now that we have defined how to compile constants, what output will we get from executing the program
15
? What will we get from running the programtrue
?
2.4 Defining arithmetic over our representations
Well clearly, even before arithmetic, the first operation we should redefine is
print
– or else we cannot trust any output of our program!
Do Now!
How many cases of output should we check for, if we’re coding
All even int
s will be considered numbers. Only two odd numbers have
meaning, as true
and false
; everything else is an error for now:
const int BOOL_TAG = 0x00000001;
const int BOOL_TRUE = ...; // These must be the same values
const int BOOL_FALSE = ...; // as chosen in compile.ml
int print(int val) {
if ((val & BOOL_TAG) == 0) { // val is even ==> number
printf("%d", val >> 1); // shift bits right to remove tag
} else if (val == BOOL_TRUE) {
printf("true");
} else if (val == BOOL_FALSE) {
printf("false");
} else {
printf("Unknown value: %#010x", val); // print unknown val in hex
}
return val;
}
Now let’s consider arithmetic operations.
Do Now!
What currently is output if we execute
add1(0)
? Redefineadd1
andsub1
to fix this.
Zero is represented as 0x00000000
, so add1(0)
currently produces
0x00000001
, which is a representation of a boolean value! Our
implementations of add1
, and more generally addition and subtraction, need
to take that representation into account.
Operation |
| Representation |
| Result |
| Interpretation |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
n1 + n2 |
| 2n1 + 2n2 |
| 2(n1 + n2) |
| n1 + n2 |
Interestingly, our notion of addition and subtraction just works with
our encoding, because multiplication distributes over addition. This is very
convenient: no changes are needed to how we compile addition or
subtraction.1Not quite! What important cases are we missing? There’s
nothing we can do about them yet, but we shoudl be aware that there are
lingering problems. And it informs us how we should revise add1
and
sub1
: they should add or subtract not 1, but rather the
representation of 1, i.e. 2.
Do Now!
What currently is output if we execute
3 * 2
? Redefine multiplication to fix this.
Following the same logic as above, 3 * 2
is encoded as 6 * 4
, which
produces 24
as a result, which is interpreted as 12, which is
twice as big as the desired result. When we multiply, we wind up with an extra
factor of two since both operands have individually been multiplied by two.
Our implementation of multiplication must shift the result right by one bit, to
compensate. The shr val, bits
instruction shifts the given val
to
the right by the given number of bits
.
Do Now!
Ok, so multiplication seems fixed. What happens if we execute
2 * (-1)
? Why does that result appear?
We obtain 2147483644
, or 0x7FFFFFFFC
. The trouble is that negative
numbers are represented by having the MSB set to 1 (the “sign bit”), and shifting
right does not preserve that sign bit. Instead, we need arithmetic shift
right, instead of logical shift right: a dedicated instruction
sar
that behaves like shr
, except it keeps the sign bit unchanged.
2.5 Defining comparisons over our representations
Comparisons are slightly finicky to define well. Here are two potential strategies, focusing just on the less-than operator:
First, we can just use the cmp
operation, which compares its two operands
arithmetically.2In fact, the cmp
operator simply subtracts one
argument from the other, just like sub
does, and sets the same overflow,
zero, and signedness flags. In contrast to sub
, cmp
discards the
result of the subtraction, and doesn’t overwrite its first argument. We then
use one of the conditional jump instructions much like compiling an
if-expression, and move the appropriate boolean constant into EAX
as a
result. This is straightforward, but potentially expensive, since conditional
jump instructions can stall the processor if the processor “guesses wrong”
about whether the branch will be taken or not.
Second, there are arithmetic and bitwise tricks we can play to get the answer
correct in nearly every case, but handling overflow will not work properly.
For example, to determine if 6 < 15
, we can simply compute their
difference, or -9
. Because this is a negative number, its MSB is 1
,
which is exactly the defining bit of true
. We merely need to set all the
remaining bits to BOOL_TRUE
’s bits. If we chose a representation with all
zeros, then and
’ing our answer with BOOL_MASK
will force all the
remaining bits to zeros. If we chose a representation with all ones, then
or
’ing our answer with the complement of BOOL_MASK
will force all the
remaining bits to ones. A similar argument holds for checking if 15 < 6
,
because that subtraction yields a positive number, whose MSB is zero, which is
the defining bit for false
.
Unfortunately, this approach fails for large numbers. Checking if
1073741823 < -1073741824
fails with our representations, because we
represent the former by 0x7ffffffe
, and the latter by 0x80000000
.
Subtracting the latter from the former yields 0xfffffffe
, or -2
(which in turn encodes -1
in our representation), whose MSB is 1,
leading to a conclusion that 1073741823 < -1073741824
is true! The
problem is that our subtraction overflowed 31 bits on these large numbers, and
no longer makes sense under our representation. The solution, then is to use
sar
to shift both operands right by one bit, before doing the
subtraction, so that overflow can’t spill into the sign bit. Using these two
extra shifts affects our compilation: even though we’ve compiled both arguments
to the comparison into immediates, we actually need to preprocess both of them,
which means we need a place to stash them both. Practically speaking, we can
use EAX
for one, and EBX
for the other, so long as nothing else in
our compilation is using EBX
(which at the moment, indeed, nothing is).
Do Now!
Complete this analysis for other comparison operations, being careful about overflow in all cases.
2.6 Defining logic operations over our representations
Logical and
and or
are very easy to define. Because we’ve confined
the differences between the representations of true
and false
to a
single bit (the MSB), and in all other bits they are the same, logical
and
and or
are equivalent to bitwise and
and or
, for
which we have dedicated assembly instructions.
Logical not
is slightly trickier. We could use not
to invert the
MSB, but that would invert all the bits of the value —
Input bit |
| Flip? |
| Output |
1 |
| 1 |
| 0 |
0 |
| 1 |
| 1 |
1 |
| 0 |
| 1 |
0 |
| 0 |
| 0 |
This is the truth table that defines the exclusive-or operation, and the
xor
instruction applies this operation bitwise. Suppose we designed a
mask 0x80000000
, whose only active bit is the MSB. Then
xor EAX, 0x80000000
will negate only the MSB, and leave the rest
unchanged, as desired.
3 Overflow and Bogus Arguments
Unfortunately, our code currently does not prohibit bogus programs, such as
5 + true
. Indeed, we don’t even have a mechanism yet for enforcing such
things. Additionally, while our compilation of constants prohibits
compile-time constants that overflow 31 bits, nothing prohibits us from adding
numbers together until they overlow. Checking for such broken behaviors is the
focus of next lecture.
4 Testing
As we’ve now added a new type of value to our language, it’s crucial to test thoroughly that we properly process and render each type of value consistently: that is, any operations that produce booleans should always produce valid boolean representations, and similarly for numeric operations. Moreover, operations should process their operands correctly. (We don’t yet have to worry about operations handling operands of the wrong types, as noted above.) Catching inconsistencies at this stage is substantially easier than it will be later, since we haven’t added too many new expression forms to the language.
1Not quite! What important cases are we missing? There’s nothing we can do about them yet, but we shoudl be aware that there are lingering problems.
2In fact, the cmp
operator simply subtracts one
argument from the other, just like sub
does, and sets the same overflow,
zero, and signedness flags. In contrast to sub
, cmp
discards the
result of the subtraction, and doesn’t overwrite its first argument.