Lecture 5: Runtime representations for multiple value types

7.1

Lecture 5: Runtime representations for multiple value types

We ought to be able to handle more kinds of values than just numbers. Let’s try to add at least one more type of value to our language. Doing so will trigger a large number of problems, which will all need to be resolved to make things work again.

1 Growing the language: adding boolean values

When we introduced conditionals and branching control-flow last time, we were a bit dissatisfied with the semantics of “branch if it’s nonzero or not” — that seemed like a hack justified only by the limitations of our language. Today, let’s consider adding the simplest additional type of data to our language: boolean values. It’s just two new values! What will we need to support them?

Do Now!
What new primitive operations should we add to operate with booleans? What should happen if we give numeric arguments to boolean operations? Or vice versa?

Exercise
We currently represent numbers as themselves. How can we represent booleans?

Exercise
How can we prevent things from going wrong?

The key challenge, from which all the other problems today will arise, is that our current encoding of numbers is too big. We’ve assigned a numeric meaning to every 32-bit pattern, and therefore used them all up: there are no bit patterns left to represent boolean values. Somehow, we need to carve out space for our booleans. As soon as we do that, though:

Not all operations should operate on all bit patterns, which means we have to ensure that operations can distinguish valid operands from invalid ones.
Not all patterns can be considered numbers, which means we have to ensure that our compilation of numbers makes sense in this new extended setup.
It becomes apparent that just supporting booleans is shortsighted: we’d like to be able to broaden these techniques to other datatypes later on, such as pointers, tuples, structures, closures, or others.

A lot of work, for just two values! But that work forms a blueprint to follow when growing the language further.

1.1 Wait, what about static typing?

Surely we can resolve some if not all of these problems by designing a type-checker for our language? After all, if we can determine statically that the program never miuses an operation with the wrong type of operands, then we don’t need to distinguish those types of values at runtime?

Sadly, not quite: while some operations will be simplified if we do this, there are still features we’d like to have in our compiler, such as garbage collection, that will inevitably need to treat some types of values differently than others. Without getting into the details now (we’ll save that for later), learning how to deal with distinguishing values now, while the language is still small, will be good practice when the language gets more complicated soon.

1.2 The new concrete syntax

Today’s language changes themselves are syntactically quite simple: only two new expression forms, and a bunch of new operators. Notice that we have operators that consume and return booleans, and also operators that consume numbers and return booleans – language features interact.

‹prim1›: ... | ! ‹prim2›: ... | && | || | < | <= | > | >= | == | != ‹expr›: ... | true | false

1.3 Examples

Exercise
Develop a suite of test examples for using booleans. Be thorough, and consider as many edge cases as possible.

1.4 Enhancing the abstract syntax

The abstract syntax changes are similarly straightforward to the concrete ones.

type prim1 = ... | Not
type prim2 = ... | And | Or | Less | Greater | LessEq | GreaterEq | Eq | Ne
type 'a expr = ... | EBool of bool * 'a

Like numbers, we will consider EBool _ values to be immediates for the purposes of ANF: no further computation on them is needed.

2 Representation choices

If all of our bit-patterns so far are used for numbers, then when we choose some patterns to reinterpret as booleans, we must have some way of distinguishing boolean (or other) patterns from numeric ones.

A note about notation: since we will be discussing actual bit-level representations today, it will be useful to use either binary or hexadecimal notations for numbers, rather than more common decimal notation. I.e.

0b00000000_00000000_00000000_00001010 = 0x0000000A = 10

For clarity, if needed, we’ll include underscores to separate the bytes in the binary representation. Further, if there are bits that we don’t care about, we will mark them as an x in the binary representation.

2.1 Attempt 1: Why limit ourselves to 32-bits?

If all the patterns in a 32-bit word are used up...let’s use more words. Suppose we encode all our values as two words, where the first word is a tag word describing what type of value comes next — e.g. 0 for number, 1 for boolean — and the subsequent word contains the actual value.

Do Now!
List at least three consequences of this approach.

On the one hand, this approach is certainly future-proof: we have room to distinguish 4 billion distinct types of data in our programs! But on the other, consider how we would have to implement any of our operations. Because our values are now larger than our word-size, they won’t fit into registers any more, which in turn means that our invariant of “the answer always goes in EAX” won’t work any longer. Instead, we will need to always put our answers on the stack, and then put the address of the first word of the answer into EAX: i.e., the actual answer can be found in [EAX] and [EAX - 4]. This in turn means that every single operation we want to perform will require at least two memory reads and writes — which is an exorbitant cost.

2.2 Attempt 2: We don’t need 32 bits for a tag

If having 32 tag bits is overkill, let’s try the minimum needed: could we get away with a single tag bit? We can pick any bit we like for this, but for reasons that will become apparent soon, let’s choose the least-significant bit (LSB) to denote numbers versus booleans:

LSB = 0 implies the value is a number
LSB = 1 implies the value is a boolean

This may still feel like overkill: we have 2 billion bit patterns reserved just for booleans, and we’ve given up half of our available space for numbers. In practice, we’ll accept the loss of range on numbers for now, and we’ll refine our notion of what it means to “tag a boolean” so that we can use the rest of that reserved space for other types of values, later.

2.2.1 Booleans

We now need to choose representations for true and false themselves. Since it’s very likely that tags are going to get more complex later, we probably want to reserve several low-order bits for future tag use — so using any of those bits to encode our values is probably not future-proof. Instead, let’s look to the other end of the value, at its most-significant bit (MSB), and declare that

MSB = 0 encodes false
MSB = 1 encodes true

In other words, our true and false values will therefore be 0b1xxxxxxx_xxxxxxxx_xxxxxxxx_xxxxxxx1 and 0b0xxxxxxx_xxxxxxxx_xxxxxxxx_xxxxxxx1, respectively. We can choose whatever bits we want in between: two obvious choices would be either all zeros or all ones.

2.2.2 Numbers

How should we represent numbers, then? Since we’ve decided that an LSB of zero indicates a number, we effectively have all the bit-patterns-formerly-interpreted-as-even-numbers available to us as numbers. This lends itself to a convenient encoding: we’ll simply represent the number n as the bit-pattern that used to represent 2n:

Number		Representation
1		`0b00000000_00000000_00000000_00000010`
6		`0b00000000_00000000_00000000_00001100`
42		`0b00000000_00000000_00000000_01010100`

This is clearly a one-to-one mapping, so it’s a faithful encoding; let’s see if it’s a useful one, too.

2.3 Compiling constants

Compiling our new constants is fairly straightforward. There is a twist: we need to ensure at compile time that any numeric constants we have in our source program aren’t inherently overflowing:

let compile_imm e env =
  match e with
  | ENumber(n, _) ->
     (*     (2^30 - 1)         -(2 ^ 30)    *)
     if n > 1073741823 || n < -1073741824 then
       failwith ("Integer overflow: " ^ (string_of_int n))
     else
       Const(n lsl 1)
  | EBool(true, _) -> const_true
  | EBool(false, _) -> const_false
  ...

Here, the lsl operator logically shifts left the bits of its first operand by some number of places given by the second operand. The newly vacated bits are filled in with zeros. As a consequence, (n lsl 1) == 2 * n...but since we’re contemplating these values as bit patterns, it’s a useful observation to notice that multiplying by two simply slides the bits around: perhaps we can rely on that later to unslide them as needed.

Do Now!
It’s worth pausing to confirm that lsl does have the desired semantics in all cases, especially when n is negative. Given that our numbers use two’s complement representation, what can we determine about the most significant bits of any negative number in our valid range of numbers? Given that, what we can we deduce about the behavior of lsl?

The const_true and const_false values are whatever bit patterns we chose above; for legibility, we ought to define them as named constants in our compiler, rather than scatter magic numbers throughought our code.

Exercise
Now that we have defined how to compile constants, what output will we get from executing the program 15? What will we get from running the program true?

2.4 Defining arithmetic over our representations

Well clearly, even before arithmetic, the first operation we should redefine is print – or else we cannot trust any output of our program!

Do Now!
How many cases of output should we check for, if we’re coding print defensively?

All even ints will be considered numbers. Only two odd numbers have meaning, as true and false; everything else is an error for now:

const int BOOL_TAG   = 0x00000001;
const int BOOL_TRUE  = ...; // These must be the same values
const int BOOL_FALSE = ...; // as chosen in compile.ml
int print(int val) {
  if ((val & BOOL_TAG) == 0) { // val is even ==> number
    printf("%d", val >> 1); // shift bits right to remove tag
  } else if (val == BOOL_TRUE) {
    printf("true");
  } else if (val == BOOL_FALSE) {
    printf("false");
  } else {
    printf("Unknown value: %#010x", val); // print unknown val in hex
  }
  return val;
}

Now let’s consider arithmetic operations.

Do Now!
What currently is output if we execute add1(0)? Redefine add1 and sub1 to fix this.

Zero is represented as 0x00000000, so add1(0) currently produces 0x00000001, which is a representation of a boolean value! Our implementations of add1, and more generally addition and subtraction, need to take that representation into account.

Operation	Representation	Result	Interpretation
`3 + 4`	`0x00000006 + 0x00000008`	`0x0000000D`	`7`
`8 + 9`	`0x00000010 + 0x00000012`	`0x00000022`	`17`
n1 + n2	2n1 + 2n2	2(n1 + n2)	n1 + n2

Interestingly, our notion of addition and subtraction just works with our encoding, because multiplication distributes over addition. This is very convenient: no changes are needed to how we compile addition or subtraction.1Not quite! What important cases are we missing? There’s nothing we can do about them yet, but we shoudl be aware that there are lingering problems. And it informs us how we should revise add1 and sub1: they should add or subtract not 1, but rather the representation of 1, i.e. 2.

Do Now!
What currently is output if we execute 3 * 2? Redefine multiplication to fix this.

Following the same logic as above, 3 * 2 is encoded as 6 * 4, which produces 24 as a result, which is interpreted as 12, which is twice as big as the desired result. When we multiply, we wind up with an extra factor of two since both operands have individually been multiplied by two. Our implementation of multiplication must shift the result right by one bit, to compensate. The shr val, bits instruction shifts the given val to the right by the given number of bits.

Do Now!
Ok, so multiplication seems fixed. What happens if we execute 2 * (-1)? Why does that result appear?

We obtain 2147483644, or 0x7FFFFFFFC. The trouble is that negative numbers are represented by having the MSB set to 1 (the “sign bit”), and shifting right does not preserve that sign bit. Instead, we need arithmetic shift right, instead of logical shift right: a dedicated instruction sar that behaves like shr, except it keeps the sign bit unchanged.

2.5 Defining comparisons over our representations

Comparisons are slightly finicky to define well. Here are two potential strategies, focusing just on the less-than operator:

First, we can just use the cmp operation, which compares its two operands arithmetically.2In fact, the cmp operator simply subtracts one argument from the other, just like sub does, and sets the same overflow, zero, and signedness flags. In contrast to sub, cmp discards the result of the subtraction, and doesn’t overwrite its first argument. We then use one of the conditional jump instructions much like compiling an if-expression, and move the appropriate boolean constant into EAX as a result. This is straightforward, but potentially expensive, since conditional jump instructions can stall the processor if the processor “guesses wrong” about whether the branch will be taken or not.

Second, there are arithmetic and bitwise tricks we can play to get the answer correct in nearly every case, but handling overflow will not work properly. For example, to determine if 6 < 15, we can simply compute their difference, or -9. Because this is a negative number, its MSB is 1, which is exactly the defining bit of true. We merely need to set all the remaining bits to BOOL_TRUE’s bits. If we chose a representation with all zeros, then and’ing our answer with BOOL_MASK will force all the remaining bits to zeros. If we chose a representation with all ones, then or’ing our answer with the complement of BOOL_MASK will force all the remaining bits to ones. A similar argument holds for checking if 15 < 6, because that subtraction yields a positive number, whose MSB is zero, which is the defining bit for false.

Unfortunately, this approach fails for large numbers. Checking if 1073741823 < -1073741824 fails with our representations, because we represent the former by 0x7ffffffe, and the latter by 0x80000000. Subtracting the latter from the former yields 0xfffffffe, or -2 (which in turn encodes -1 in our representation), whose MSB is 1, leading to a conclusion that 1073741823 < -1073741824 is true! The problem is that our subtraction overflowed 31 bits on these large numbers, and no longer makes sense under our representation. The solution, then is to use sar to shift both operands right by one bit, before doing the subtraction, so that overflow can’t spill into the sign bit. Using these two extra shifts affects our compilation: even though we’ve compiled both arguments to the comparison into immediates, we actually need to preprocess both of them, which means we need a place to stash them both. Practically speaking, we can use EAX for one, and EBX for the other, so long as nothing else in our compilation is using EBX (which at the moment, indeed, nothing is).

Do Now!
Complete this analysis for other comparison operations, being careful about overflow in all cases.

2.6 Defining logic operations over our representations

Logical and and or are very easy to define. Because we’ve confined the differences between the representations of true and false to a single bit (the MSB), and in all other bits they are the same, logical and and or are equivalent to bitwise and and or, for which we have dedicated assembly instructions.

Logical not is slightly trickier. We could use not to invert the MSB, but that would invert all the bits of the value — including the tag bit, which would convert the result into a representation of a number. Really, we want to invert only the MSB, and leave all the other bits alone. In other words, we want the following truth table:

Input bit	Flip?	Output
1	1	0
0	1	1
1	0	1
0	0	0

This is the truth table that defines the exclusive-or operation, and the xor instruction applies this operation bitwise. Suppose we designed a mask 0x80000000, whose only active bit is the MSB. Then xor EAX, 0x80000000 will negate only the MSB, and leave the rest unchanged, as desired.

3 Overflow and Bogus Arguments

Unfortunately, our code currently does not prohibit bogus programs, such as 5 + true. Indeed, we don’t even have a mechanism yet for enforcing such things. Additionally, while our compilation of constants prohibits compile-time constants that overflow 31 bits, nothing prohibits us from adding numbers together until they overlow. Checking for such broken behaviors is the focus of next lecture.

4 Testing

As we’ve now added a new type of value to our language, it’s crucial to test thoroughly that we properly process and render each type of value consistently: that is, any operations that produce booleans should always produce valid boolean representations, and similarly for numeric operations. Moreover, operations should process their operands correctly. (We don’t yet have to worry about operations handling operands of the wrong types, as noted above.) Catching inconsistencies at this stage is substantially easier than it will be later, since we haven’t added too many new expression forms to the language.

1Not quite! What important cases are we missing? There’s nothing we can do about them yet, but we shoudl be aware that there are lingering problems.

2In fact, the cmp operator simply subtracts one argument from the other, just like sub does, and sets the same overflow, zero, and signedness flags. In contrast to sub, cmp discards the result of the subtraction, and doesn’t overwrite its first argument.

1	Growing the language: adding boolean values
2	Representation choices
3	Overflow and Bogus Arguments
4	Testing