Copyright 2000 William D Clinger Modified March 2002 by William D Clinger **************************************************************** Outline of lecture. pitfalls of benchmarking importance of code improvement (optimization) examining the code generated by a compiler disassembly delayed branch instructions return addresses software timers write barriers register windows debugging is made more difficult by optimization reordering of computations elimination of variables transformation of representations classification of optimizations respectable code generation assembly-level optimization local optimization trace optimization loop optimization intraprocedural ("global") optimization interprocedural optimization respectable code generation E1 B E2 rewrite E1 != E2 as ! (E1 == E2) recognize immediate operands: E1 == true, E1 == false. E1 == null. E1 == 'c'. E1 == 17, E1 < 17, et cetera. E1 + 17, E1 - 17. if neither E1 nor E2 has a side effect, then evaluate E2 before E1 E1 [ E2 ] recognize immediate index: E1[3] if neither E1 nor E2 has a side effect, then evaluate E2 before E1 E0 (E1, ..., En) if E0 is a variable and none of E1, ..., En have side effects, then E0 can be evaluated after E1, ..., En evaluation for control control optimization proper tail recursion assembly-level optimization peephole optimization filling of branch delay slots inline allocation local optimization parallel assignment optimization scheduling of load instructions, et cetera trace optimization deletion of locally redundant instructions branch tensioning loop optimization hoisting of loop-invariant computations (*) intraprocedural optimization copy propagation common subexpression elimination dead code elimination dead variable elimination register allocation interprocedural optimization lambda lifting and closure conversion register targeting inlining of small procedures constant propagation constant folding representation and subrange inference (*) these optimizations are not performed by Twobit **************************************************************** Pitfalls of benchmarking. Benchmarks are a popular way to compare the performance of two or more systems, whether they be microprocessors, compilers, network protocols, application programs, or whatever. The idea is simple: Devise a standard task, and measure how long it takes each system to accomplish the task. Interpreting the results is not simple, though. Most people would conclude that the system that completes the task in the least amount of time is probably the fastest. That conclusion is usually incorrect. Usually none of the systems are faster than all of the others. A system may be faster on one particular benchmark, but slower on another. The main thing you can conclude from a benchmark is how well a system performs on that particular benchmark. Even this conclusion may be unwarranted, because the benchmark results may be contaminated by measurement noise. One reason that people are interested in benchmarks is that they want to predict how a system will perform on programs that are similar to the benchmarks. To do this well, you need to understand why the system performs as it does. **************************************************************** Timings on a Sun Ultra 5, 2000. quirk18 -slow Loop 28.65 ****** quirk18 -slow Tail 67.54 *** quirk18 -slow Deep 181.35 * quirk18 Loop 15.15 *********** quirk18 Tail 45.83 **** quirk18 Deep 157.21 * quirk18 -opt Loop 21.75 ******** quirk18 -opt Tail 34.25 ***** quirk18 -opt Deep 121.12 * Larceny Loop 20.19 ******** Larceny Tail 20.95 ******** Larceny Deep 36.38 ***** Larceny -opt Loop 9.04 ******************* Larceny -opt Tail 13.56 ************ Larceny -opt Deep 36.38 ***** Larceny -bad Loop 4.53 ************************************* Larceny -bad Tail 4.54 ************************************* Larceny -bad Deep 43.36 **** Chez Loop 6.07 **************************** Chez Tail 6.08 **************************** Chez Deep 36.37 ***** Java Loop 3.38 ************************************************** Java Tail 220.69 * Java Deep 218.60 * **************************************************************** Measurement noise. With multiuser operating systems such as Unix, the time required to run a benchmark depends upon the number of users and what they are doing. To minimize this effect, people often measure CPU time instead of elapsed time, but even the CPU time can vary in response to other users. With multitasking operating systems such as Unix, Windows 98, and MacOS, the time required to run a benchmark depends on the other tasks that the operating system is performing. The CPU time varies less than the elapsed time, but even the CPU time can vary in response to other tasks. The benchmark timings that I report above are CPU times that were obtained on a Sun Ultra 5 with no other users and no other user jobs, but there were several system processes that I could not easily disable. In most cases I report the median of three runs that differed by less than one per cent. For some benchmarks, I report the median of five runs. The slowest timings represent a single run. The complexity of modern microprocessors has created several new forms of measurement noise. This noise is generated by data and instruction caches, dynamic branch prediction, speculative execution, and especially by branch target alignment and by instruction alignment. The speed of an inner loop depends upon its alignment with respect to the processor's instruction cache and instruction buffer. Most of this noise is quantized, which makes it easy to overlook. Consider, for example, the timings obtained in Larceny for the Loops and TailCalls benchmarks. With the default optimizations, Larceny appears to run these benchmarks in about 20 seconds. With all optimizations, Larceny appears to run the Loops benchmark in about 9 seconds, and the TailCalls benchmark in about 13.5 seconds. Most people would conclude from these timings that Twobit's optimizations are making a big difference, and would also conclude that tail calls are about 50% slower than loops in optimized Scheme code. Both of these conclusions would be incorrect. The remarkable thing about these timings is that all four are measuring the speed of an inner loop that has been compiled into exactly the same SPARC machine code, except that the inner loop for the Loops benchmark begins at offset 332 and the inner loop for the TailCalls benchmark begins at offset 444. This loop executes 12 machine instructions on each iteration, and there are 500 million iterations. By forcing a full garbage collection, I can force Larceny to move this SPARC machine code to a different address. Depending on the alignment, I have observed the following timings for this code. (All timings were observed on the same machine, and are rounded to the nearest half second.) 7.5 9.0 13.5 20.0 20.5 21.0 These timings probably correspond to different alignments of the SPARC machine code with respect to an 8-word instruction buffer. (The variation between the last three timings might be caused by other forms of measurement noise, so those timings might correspond to a single alignment.) **************************************************************** SPARC code for the inner loop in question, as generated by Twobit/Larceny in 2000. static int natsum (int m, int n) { while (m > 0) { m = m - 1; n = n + 1; } return n; } L1002 .proc .proc-doc #(.loop|20|23|26 #f 2 #f #f (m n)) reg/op2imm/branchf internal:branchf-fx>/imm,1,0,1012 reg/op2imm/setreg internal:fx-/imm,1,1,1 reg/op2imm/setreg internal:fx+/imm,2,1,2 branch 1002,2 L1012 reg/return 2 332 andcc %r1, 3, %g0 ! reg/op2imm/branchf ... 336 be,a #364 340 subcc %r1, 0, %g0 344 or %g0, %r1, %result 348 or %g0, 0, %argreg2 352 or %g0, 584, %tmp0 ! 0x248 356 jmpl %globals + 1384, %o7 ! exception 360 add %o7, -32, %o7 364 ble,a #460 368 ld [ %stkp + 4 ], %o7 372 tsubcc %r1, 4, %tmp0 ! reg/op2imm/setreg ... 376 bvc,a #404 380 or %g0, %tmp0, %r1 384 or %g0, %r1, %result 388 or %g0, 4, %argreg2 392 or %g0, 564, %tmp0 ! 0x234 396 jmpl %globals + 1384, %o7 ! exception 400 add %o7, -32, %o7 404 taddcc %r2, 4, %tmp0 ! reg/op2imm/setreg ... 408 bvc,a #436 412 or %g0, %tmp0, %r2 416 or %g0, %r2, %result 420 or %g0, 4, %argreg2 424 or %g0, 560, %tmp0 ! 0x230 428 jmpl %globals + 1384, %o7 ! exception 432 add %o7, -32, %o7 436 subcc %timer, 1, %timer ! branch 1002,2 440 bne,a #336 444 andcc %r1, 3, %g0 448 jmpl %globals + 1376, %o7 ! timer-exception 452 add %o7, -124, %o7 456 ld [ %stkp + 4 ], %o7 ! reg/return 2 460 jmpl %o7 + 8, %g0 464 or %g0, %r2, %result **************************************************************** These effects also explain the apparent superiority of unoptimized quirk18 code over "optimized" quirk18 code on the Loops benchmark. If you look at the MacScheme machine assembly code, however, the "optimized" code certainly looks better. **************************************************************** static int natsum (int m, int n) { while (m > 0) { m = m - 1; n = n + 1; } return n; } unoptimized optimized L1003 L1003 reg 1 reg 1 setreg 3 op2imm fx>,0 const 0 setreg 4 reg 3 op2 fx>,4 branchf 1004,31 branchf 1004,31 reg 1 reg 1 setreg 3 op2imm fx-,1 const 1 setreg 4 reg 3 op2 fx-,4 setreg 1 setreg 1 reg 2 reg 2 setreg 3 op2imm fx+,1 const 1 setreg 4 reg 3 op2 fx+,4 setreg 2 setreg 2 branch 1003,31 branch 1003,31 L1004 L1004 reg 2 reg 2 skip 1002,31 skip 1002,31 L1002 L1002 return return **************************************************************** Runtime checking for overflow. Larceny's fx+ and fx- primitives normally perform a runtime test for overflow. This and other runtime checking can be disabled. Since Java does not test for arithmetic overflow, this provides a better comparison with Java on the loop and recursion benchmarks, but is unfair to Java on the benchmarks that should involve runtime checking of array subscripts and runtime tests for null pointers. So Larceny without runtime checking is identified as Larceny -bad. Larceny's fixnum primitives were a recent addition, and hadn't yet appeared in a general release. The code shown below shows an obvious performance bug in fx>. So neither quirk18 nor this development version of Larceny are really generating respectable code. **************************************************************** SPARC code for the inner loop in question, as generated by Twobit/Larceny in 2000, with runtime checking disabled. static int natsum (int m, int n) { if (m > 0) return natsum (m - 1, n + 1); else return n; } L1002 .proc .proc-doc #(.natsum|4 #f 2 #f #f (m n)) reg/op2imm/branchf internal:branchf-fx>/imm,1,0,1014 reg/op2imm/setreg internal:fx-/imm,1,1,1 reg/op2imm/setreg internal:fx+/imm,2,1,2 branch 1002,2 L1014 reg/return 2 360 subcc %r1, 0, %g0 364 ble,a #412 368 ld [ %stkp + 4 ], %o7 372 ble,a #412 376 ld [ %stkp + 4 ], %o7 380 sub %r1, 4, %r1 384 add %r2, 4, %r2 388 subcc %timer, 1, %timer 392 bne,a #364 396 subcc %r1, 0, %g0 400 jmpl %globals + 1376, %o7 ! timer-exception 404 add %o7, -48, %o7 408 ld [ %stkp + 4 ], %o7 412 jmpl %o7 + 8, %g0 416 or %g0, %r2, %result **************************************************************** Timings on a SunBlade 100, March 2002. Larceny v0.50, Chez Scheme v6.1, Sun Java HotSpot version 1.3.0. Loops quirk19 10.40 ****************** Larceny 7.19 ************************** Chez 14.30 ************* Java 3.80 ************************************************** TailCalls quirk19 31.30 ************* Larceny 8.16 ************************************************** Chez 14.29 ***************************** Java 500.60 * DeepRecursions quirk19 193.27 ******* Larceny 25.84 ************************************************** chez 26.48 ************************************************* Java 497.90 *** TreeRecursions quirk19 20.20 *************** Larceny 6.24 ************************************************** Chez 6.69 *********************************************** Java 6.73 ********************************************** Integrate quirk19 83.63 **** Larceny 85.33 **** Chez 47.82 ******* Java 6.70 ************************************************** SelectionSort quirk19 31.53 ******************* Larceny 26.55 ********************** Chez 45.43 ************* Java 11.80 ************************************************** Permutations9:1 quirk19 6.20 ************* Larceny 1.61 ************************************************** Chez 2.40 ********************************** Java 8.83 ********* Permutations8:50 quirk19 12.03 ************ Larceny 2.93 ************************************************** Chez 3.69 **************************************** Java 9.47 *************** Sieve quirk19 6.33 ** Larceny 0.47 ******************************* Chez 0.29 ************************************************** Java 2.96 ***** **************************************************************** Interpretation of benchmark results. Benchmark results don't lie, but they're usually misleading. In my opinion, benchmarking is worse than useless unless you take the trouble to understand what is going on. That's why I like micro-benchmarks like the nine above. They aren't representative, but they are small enough to be understood, and small enough to isolate important differences between compilers and runtime systems. The Loops, TailCalls, and DeepRecursions benchmark each consist of a nested pair of loops, each of which counts down to 0. They differ in how the inner loop is expressed. In Loops, the inner loop is a while loop. In TailCalls, the inner loop is an equivalent self-tail-recursion. In DeepRecursions, the inner loop is an equivalent non-tail-recursion. The inner loop is executed 10000 times for 50000 iterations of the outer loop. On Loops, Java is fastest, mainly because it performs arithmetic mod 2^32. On TailCalls, Java is slowest because the designers of the JVM made it hard to implement proper tail recursion, and the implementors of the Java compiler didn't bother to perform tail call optimization even in easy cases. Larceny and Chez Scheme generate exactly the same code for the inner loop on Loops and TailCalls; the small change in the Larceny timing is repeatable, so it is probably caused by a change in instruction alignment. On DeepRecursions, Java performs no slower than on TailCalls, because it had compiled the tail calls as a deep recursion. The reason Java is so much slower than Larceny and Chez Scheme on deep recursions is that Sun's implementation of Java uses the SPARC's standard calling conventions, which use the SPARC's register windows. There are only about 20 register windows on-chip. (There's room for more, but Sun doesn't want Unix context switches to become too slow.) Every 20 calls, the Java code will suffer a register window overflow exception. This exception takes time to process, and it has to copy the register windows to main memory. Every 20 returns, the Java code will suffer a register window underflow exception, which copies register windows back from main memory. Furthermore each register window consists of 96 bytes, which not coincidentally is the smallest stack frame allowed by the SPARC's standard calling conventions. In Larceny, each stack frame is only 16 bytes, which is the smallest stack frame allowed by Larceny's calling conventions. Thus Java performs 12 times as many loads and stores as Larceny, and must also process many overflow and underflow exceptions. No wonder it's almost 20 times as slow. TreeRecursions features the shallow recursions on which the SPARC's register windows really shine, so why doesn't Java outperform Larceny and Chez Scheme on this benchmark? Because the Java compiler allocates a register window for every call. The Scheme compilers allocate a stack frame only when necessary; they don't allocate a frame for the base case of the recursion, which accounts for about half of the calls. Integrate is a floating point benchmark. The Scheme systems do badly because Scheme's arithmetic is dynamically generic, and the Scheme compilers generate inline code only for small integer arithmetic. When the operands are floating point numbers, the arithmetic operation calls library code that tests to learn that the operands are in floating point format, loads them into the floating point registers, allocates heap storage for the result, and stores the result into that newly allocated storage. Java's arithmetic primitives know that their operands are in floating point format, and can leave most of the intermediate results in registers. SelectionSort, whose code you will see later, is mainly a test of arrays. All of the tested systems are safe, but the HotSpot JIT compiler probably does a better job of eliminating dynamic range checks on this benchmark. Permutations9:1 is a pure storage allocation benchmark. All of the allocated storage becomes part of the result, so no garbage is generated, and any attempts to collect garbage are a waste of time. Larceny and Chez Scheme use highly bummed sequences of inline code to allocate storage. Java's storage allocators are probably not as efficient, and Java's object constructors allow arbitrary initialization code but are therefore less efficient than Scheme's cons. Permutations8:50 is a garbage collection benchmark in which all permutations of 8 things are computed and then thrown away 50 times (after a 1-iteration delay, which makes this a much more severe test of garbage collection than benchmarks that discard the result immediately after it is computed). HotSpot performs respectably, but HotSpot's generational garbage collector probably isn't as fast as the generational collectors in Larceny and Chez Scheme. By the way, both C and C++ tend to perform poorly on this benchmark unless a custom allocator/deallocator is written, because the standard implementations of free and delete must touch every deallocated node. Generational garbage collectors do not have to do that. Sieve is a higher order benchmark that computes all the prime numbers. It would take forever to print them out, and we'd run out of memory before that, so the benchmark just counts how many primes are less than 10000. The Java version of this benchmark uses function objects. Why is allocating and initializing a function object so much slower than creating a closure? For the same reasons that Scheme cons is faster than Java or C++ new. **************************************************************** Boolean evaluation for control. Most boolean expressions are evaluated for their effect on the control flow of a program, not to generate an actual boolean value. The MacScheme machine does not provide any way to avoid generating an actual boolean value, but Larceny's peephole optimization usually eliminates the boolean. class Foo { static void main (String[] args) { } static void f (Foo x) { if (x == null) skip; else skip; } } reg 1 op2imm eq?,() branchf 1004,31 subcc %r1, 10, %g0 bne,a #28 nop **************************************************************** Control optimization. Twobit contains an entire module, with over 600 lines of code, that optimizes conditional expressions. Consider a switch statement in Java. The compiler has several choices for the algorithm that will be used at runtime to select the appropriate case: sequential search binary search jump table On a Sun Ultra 1, sequential search is fastest when there are fewer than 8 cases. Binary search is fastest is most other cases, but a jump table is fastest if there are a large number of distinct cases that are dense within some interval. **************************************************************** Assembly-level optimization. peephole optimization filling of branch delay slots inline allocation These optimizations typically improve the performance of quirk19 code by 15 - 40%. With a good code generator, peephole optimization becomes less important. In Twobit, however, the existence of a good peephole optimizer allows the code generator to be sloppy about things like boolean evaluation of control. This makes it easier for the code generator to do a good job on things that cannot be optimized by a peephole optimizer. **************************************************************** Local optimization parallel assignment optimization scheduling of load instructions, et cetera A local optimization works with the code for a basic block. Local optimizations are important because they reduce some of the obvious inefficiencies in the generated code. return natsum (m - 1, n + 1); reg 1 op2imm fx-,1 setreg 3 reg 2 op2imm fx+,1 setreg 4 movereg 3,1 movereg 4,2 global tailcalls.natsum invoke 2 reg 1,.m|19 op2imm fx-,1 setreg 1 reg 2,.n|19 op2imm fx+,1 setreg 2 global tailcalls.natsum invoke 2 Local optimizations also improve the low-level parallelism of the code. For example, the load instructions of many RISC machines require more than one cycle to complete, even if they hit the data cache. An instruction that uses the value that is loaded by a load instruction may have to wait for the load to complete. By moving load instructions as early as possible within a straight-line sequence of machine code, we make it less likely that subsequent instructions will have to wait for their data. Scheduling optimizations can interact strongly with the dynamic scheduling that is performed by current superscalar processors. **************************************************************** Trace optimization deletion of locally redundant instructions branch tensioning A trace is a finite sequence of basic blocks. Simple code generators tend to produce redundant instructions, such as load instructions that restore a register whose contents will never be used. They also tend to produce branch instructions that branch to other branch instructions, or branch to a label that immediately follows the branch. Trace optimizations can clean up these inefficiencies. Trace optimization is also important for scheduling, especially on very-long-word-instruction (VLIW) architectures such as the Intel Itanium. Trace optimization can be more effective in a just-in-time (JIT) compiler than in a conventional compiler, because a JIT compiler can optimize traces that cross module boundaries, as when a program calls a library routine. **************************************************************** Loop optimization hoisting of loop-invariant computations Most programs spend most of their time in a few inner loops, so it is important to make those loops run as fast as possible. static void selectionsort (int[] a, int n) { int i = 0; int j = 0; int k = 0; // Loop invariant: // 0 <= i <= n, and // the contents of the array are a permutation of // the original contents, and // a[0] through a[i-1] are in nondecreasing order, and // a[0] through a[i-1] are less than or equal to every // element in a[i] through a[n-1]. while (i < n) { j = i; k = i; // Loop invariant: // i <= j < n, and // i <= k <= n, and // a[j] is least among a[i] through a[k-1]. while (k < n) { if (a[j] > a[k]) j = k; else; k = k + 1; } swap (a, i, a, j); i = i + 1; } } The compiler should keep a, j, k, and n in registers throughout the innermost loop. It should also keep the length of a in a register, since that length is used implicitly by a dynamic range check (assuming a[j] and a[k] are evaluated safely). If the compiler can prove that n is the length of the array a, that i >= 0, and that the first two parts of the loop invariant shown in the comment really are loop invariant, then the compiler can eliminate the runtime range checks without compromising safety. Some compilers would replace j, k, and n by equivalent variables throughout the innermost loop, because this eliminates implicit multiplications by 4 inside the loop. This optimization is called reduction in strength. int j4 = 4 * j; int k4 = 4 * k; int n4 = 4 * n; while (k4 < n4) { if ((* (a + j4)) > (* (a + k4))) j4 = k4; else; k4 = k4 + 4; } j = j4 / 4; k = k4 / 4; With Larceny's representation of integers, there is no need for this. **************************************************************** Intraprocedural optimization copy propagation common subexpression elimination dead code elimination dead variable elimination register allocation Intraprocedural optimizations are restricted to a single procedure. They are also known as "global" optimizations. static List revloop (List x, int n, List y) { if (n == 0) return y; else return revloop (x.tail, n - 1, List.cons (x.head, y)); } For safety, the compiler must check to ensure that x is not null before it evaluates x.tail and x.head. Without optimization, the compiler would check twice. With common subexpression elimination, the compiler would check only once. If x is not null when x.tail is evaluated, then it won't be null when x.head is evaluated. Common subexpression elimination is also important for array addressing. For example, it eliminates the second implicit multiplication in a[i] = a[i] + 1; Register allocation is important because it reduces the number of load and store instructions, and because it can sometimes eliminate the need to allocate a stack frame. **************************************************************** Interprocedural optimization lambda lifting register targeting inlining of small procedures constant propagation constant folding representation and subrange inference Interprocedural optimizations can operate across procedure boundaries. Lambda lifting and closure conversion are optimizations that reduce the number of closures that need to be created. In effect, they rewrite the program to increase the number of leaf1 procedures. Register targeting reduces the number of register-to-register move instructions by heuristic selection of the registers in which arguments are passed. **************************************************************** Inlining, constant propagation, and constant folding allow the compiler to compile class PseudoRandom { ... static int random (PseudoRandom generator) { int r = mod (multiplier * generator.rand, divisor); generator.rand = r; return r; } static final int seed = 21; static final int multiplier = 1731; static final int divisor = 4001; int rand; static int mod (int m, int n) { return m - n * (m / n); } } as if it were class PseudoRandom { ... // Returns a pseudo-random number. static int random (PseudoRandom generator) { int r; { int m = 1731 * generator.rand; int n = 4001; r = m - n * (m / n); } generator.rand = r; return r; } static final int seed = 21; static final int multiplier = 1731; static final int divisor = 4001; int rand; static int mod (int m, int n) { return m - n * (m / n); } } Intraprocedural copy propagation and dead code elimination can then convert this into class PseudoRandom { ... // Returns a pseudo-random number. static int random (PseudoRandom generator) { int r; { int m = 1731 * generator.rand; r = m - 4001 * (m / 4001); } generator.rand = r; return r; } static final int seed = 21; static final int multiplier = 1731; static final int divisor = 4001; int rand; static int mod (int m, int n) { return m - n * (m / n); } } **************************************************************** Interprocedural representation and subrange inference reduce the number of runtime safety checks. For example, a Java compiler must generate code that checks every array reference a[i] to ensure that a is not null and that 0 <= i < a.length. Scheme is dynamically typed, so a Scheme compiler must also check that a is an array. Every single one of these runtime checks can be eliminated from the following program without compromising safety. class SelectionSort { static void main (String[] args) { int n = Int.parseInt (args[0]); benchmark (n); } static void benchmark (int n) { int[] a = testCase (n); selectionsort (a, n); } static int[] testCase (int n) { int[] a = new int [n]; PseudoRandom generator = PseudoRandom.create(); for (int i = 0; i < n; i = i + 1) a[i] = PseudoRandom.random (generator); return a; } static void selectionsort (int[] a, int n) { int i = 0; int j = 0; int k = 0; while (i < n) { j = i; k = i; while (k < n) { if (a[j] > a[k]) j = k; else; k = k + 1; } swap (a, i, a, j); i = i + 1; } } static void swap (int[] a, int i, int[] b, int j) { int temp = a[i]; a[i] = b[j]; b[j] = temp; } } ****************************************************************