Copyright 2000 William D Clinger
Modified March 2002 by William D Clinger

****************************************************************

Outline of lecture.

    pitfalls of benchmarking
    importance of code improvement (optimization)
    examining the code generated by a compiler
        disassembly
        delayed branch instructions
        return addresses
        software timers
        write barriers
        register windows
    debugging is made more difficult by optimization
        reordering of computations
        elimination of variables
        transformation of representations
    classification of optimizations
        respectable code generation
        assembly-level optimization
        local optimization
        trace optimization
        loop optimization
        intraprocedural ("global") optimization
        interprocedural optimization
    respectable code generation
        E1 B E2
            rewrite E1 != E2 as ! (E1 == E2)
            recognize immediate operands:
                E1 == true, E1 == false.
                E1 == null.
                E1 == 'c'.
                E1 == 17,   E1 < 17,  et cetera.
                E1 + 17,    E1 - 17.
            if neither E1 nor E2 has a side effect, then
                evaluate E2 before E1
        E1 [ E2 ]
            recognize immediate index:  E1[3]
            if neither E1 nor E2 has a side effect, then
                evaluate E2 before E1
        E0 (E1, ..., En)
            if E0 is a variable and none of E1, ..., En
                have side effects, then E0 can be evaluated
                after E1, ..., En
        evaluation for control
        control optimization
        proper tail recursion
    assembly-level optimization
        peephole optimization
        filling of branch delay slots
        inline allocation
    local optimization
        parallel assignment optimization
        scheduling of load instructions, et cetera
    trace optimization
        deletion of locally redundant instructions
        branch tensioning
    loop optimization
        hoisting of loop-invariant computations (*)
    intraprocedural optimization
        copy propagation
        common subexpression elimination
        dead code elimination
        dead variable elimination
        register allocation
    interprocedural optimization
        lambda lifting and closure conversion
        register targeting
        inlining of small procedures
        constant propagation
        constant folding
        representation and subrange inference

(*) these optimizations are not performed by Twobit

****************************************************************

Pitfalls of benchmarking.

Benchmarks are a popular way to compare the performance of two
or more systems, whether they be microprocessors, compilers,
network protocols, application programs, or whatever.

The idea is simple:  Devise a standard task, and measure how
long it takes each system to accomplish the task.

Interpreting the results is not simple, though.  Most people
would conclude that the system that completes the task in the
least amount of time is probably the fastest.  That conclusion
is usually incorrect.

Usually none of the systems are faster than all of the others.
A system may be faster on one particular benchmark, but slower
on another.

The main thing you can conclude from a benchmark is how well
a system performs on that particular benchmark.  Even this
conclusion may be unwarranted, because the benchmark results
may be contaminated by measurement noise.

One reason that people are interested in benchmarks is that
they want to predict how a system will perform on programs
that are similar to the benchmarks.  To do this well, you
need to understand why the system performs as it does.

****************************************************************

Timings on a Sun Ultra 5, 2000.

quirk18 -slow Loop   28.65  ******                                            
quirk18 -slow Tail   67.54  ***                                               
quirk18 -slow Deep  181.35  *                                                 

quirk18       Loop   15.15  ***********                                       
quirk18       Tail   45.83  ****                                              
quirk18       Deep  157.21  *                                                 

quirk18 -opt  Loop   21.75  ********                                          
quirk18 -opt  Tail   34.25  *****                                             
quirk18 -opt  Deep  121.12  *                                                 

Larceny       Loop   20.19  ********                                          
Larceny       Tail   20.95  ********                                          
Larceny       Deep   36.38  *****                                             

Larceny -opt  Loop    9.04  *******************                               
Larceny -opt  Tail   13.56  ************                                      
Larceny -opt  Deep   36.38  *****                                             

Larceny -bad  Loop    4.53  *************************************             
Larceny -bad  Tail    4.54  *************************************             
Larceny -bad  Deep   43.36  ****                                              

Chez          Loop    6.07  ****************************                      
Chez          Tail    6.08  ****************************                      
Chez          Deep   36.37  *****                                             

Java          Loop    3.38  **************************************************
Java          Tail  220.69  *                                                 
Java          Deep  218.60  *                                                 

****************************************************************

Measurement noise.

With multiuser operating systems such as Unix, the time required
to run a benchmark depends upon the number of users and what
they are doing.  To minimize this effect, people often measure
CPU time instead of elapsed time, but even the CPU time can vary
in response to other users.

With multitasking operating systems such as Unix, Windows 98,
and MacOS, the time required to run a benchmark depends on the
other tasks that the operating system is performing.  The CPU
time varies less than the elapsed time, but even the CPU time
can vary in response to other tasks.

The benchmark timings that I report above are CPU times that
were obtained on a Sun Ultra 5 with no other users and no other
user jobs, but there were several system processes that I could
not easily disable.  In most cases I report the median of three
runs that differed by less than one per cent.  For some
benchmarks, I report the median of five runs.  The slowest
timings represent a single run.

The complexity of modern microprocessors has created several new
forms of measurement noise.  This noise is generated by data and
instruction caches, dynamic branch prediction, speculative
execution, and especially by branch target alignment and by
instruction alignment.  The speed of an inner loop depends upon
its alignment with respect to the processor's instruction cache
and instruction buffer.  Most of this noise is quantized, which
makes it easy to overlook.

Consider, for example, the timings obtained in Larceny for the
Loops and TailCalls benchmarks.  With the default optimizations,
Larceny appears to run these benchmarks in about 20 seconds.
With all optimizations, Larceny appears to run the Loops
benchmark in about 9 seconds, and the TailCalls benchmark in
about 13.5 seconds.  Most people would conclude from these
timings that Twobit's optimizations are making a big difference,
and would also conclude that tail calls are about 50% slower
than loops in optimized Scheme code.

Both of these conclusions would be incorrect.  The remarkable
thing about these timings is that all four are measuring the
speed of an inner loop that has been compiled into exactly the
same SPARC machine code, except that the inner loop for the
Loops benchmark begins at offset 332 and the inner loop for
the TailCalls benchmark begins at offset 444.  This loop
executes 12 machine instructions on each iteration, and there
are 500 million iterations.

By forcing a full garbage collection, I can force Larceny to
move this SPARC machine code to a different address.  Depending
on the alignment, I have observed the following timings for this
code.  (All timings were observed on the same machine, and are
rounded to the nearest half second.)

    7.5
    9.0
   13.5
   20.0
   20.5
   21.0

These timings probably correspond to different alignments of
the SPARC machine code with respect to an 8-word instruction
buffer.  (The variation between the last three timings might
be caused by other forms of measurement noise, so those timings
might correspond to a single alignment.)

****************************************************************

SPARC code for the inner loop in question, as generated by
Twobit/Larceny in 2000.

    static int natsum (int m, int n) {
        while (m > 0) {
            m = m - 1;
            n = n + 1;
        }
        return n;
    }

L1002
        .proc
        .proc-doc   #(.loop|20|23|26 #f 2 #f #f (m n))
        reg/op2imm/branchf internal:branchf-fx>/imm,1,0,1012
        reg/op2imm/setreg internal:fx-/imm,1,1,1
        reg/op2imm/setreg internal:fx+/imm,2,1,2
        branch      1002,2
L1012
        reg/return  2

332 andcc   %r1, 3, %g0             ! reg/op2imm/branchf ...
336 be,a    #364
340 subcc   %r1, 0, %g0
344 or  %g0, %r1, %result
348 or  %g0, 0, %argreg2
352 or  %g0, 584, %tmp0 ! 0x248
356 jmpl    %globals + 1384, %o7    ! exception
360 add %o7, -32, %o7
364 ble,a   #460
368 ld  [ %stkp + 4 ], %o7
372 tsubcc  %r1, 4, %tmp0           ! reg/op2imm/setreg ...
376 bvc,a   #404
380 or  %g0, %tmp0, %r1
384 or  %g0, %r1, %result
388 or  %g0, 4, %argreg2
392 or  %g0, 564, %tmp0 ! 0x234
396 jmpl    %globals + 1384, %o7    ! exception
400 add %o7, -32, %o7
404 taddcc  %r2, 4, %tmp0           ! reg/op2imm/setreg ...
408 bvc,a   #436
412 or  %g0, %tmp0, %r2
416 or  %g0, %r2, %result
420 or  %g0, 4, %argreg2
424 or  %g0, 560, %tmp0 ! 0x230
428 jmpl    %globals + 1384, %o7    ! exception
432 add %o7, -32, %o7
436 subcc   %timer, 1, %timer       ! branch      1002,2
440 bne,a   #336
444 andcc   %r1, 3, %g0
448 jmpl    %globals + 1376, %o7    ! timer-exception
452 add %o7, -124, %o7
456 ld  [ %stkp + 4 ], %o7          ! reg/return  2
460 jmpl    %o7 + 8, %g0
464 or  %g0, %r2, %result

****************************************************************

These effects also explain the apparent superiority of
unoptimized quirk18 code over "optimized" quirk18 code on the
Loops benchmark.  If you look at the MacScheme machine assembly
code, however, the "optimized" code certainly looks better.

****************************************************************

    static int natsum (int m, int n) {
        while (m > 0) {
            m = m - 1;
            n = n + 1;
        }
        return n;
    }

unoptimized                     optimized

L1003                           L1003
        reg         1                   reg         1
        setreg      3                   op2imm      fx>,0
        const       0
        setreg      4
        reg         3
        op2         fx>,4
        branchf     1004,31             branchf     1004,31
        reg         1                   reg         1
        setreg      3                   op2imm      fx-,1
        const       1
        setreg      4
        reg         3
        op2         fx-,4
        setreg      1                   setreg      1
        reg         2                   reg         2
        setreg      3                   op2imm      fx+,1
        const       1
        setreg      4
        reg         3
        op2         fx+,4
        setreg      2                   setreg      2
        branch      1003,31             branch      1003,31
L1004                           L1004
        reg         2                   reg         2
        skip        1002,31             skip        1002,31
L1002                           L1002
        return                          return

****************************************************************

Runtime checking for overflow.

Larceny's fx+ and fx- primitives normally perform a runtime
test for overflow.  This and other runtime checking can be
disabled.  Since Java does not test for arithmetic overflow,
this provides a better comparison with Java on the loop and
recursion benchmarks, but is unfair to Java on the benchmarks
that should involve runtime checking of array subscripts and
runtime tests for null pointers.  So Larceny without runtime
checking is identified as Larceny -bad.

Larceny's fixnum primitives were a recent addition, and hadn't
yet appeared in a general release.  The code shown below shows
an obvious performance bug in fx>.

So neither quirk18 nor this development version of Larceny are
really generating respectable code.

****************************************************************

SPARC code for the inner loop in question, as generated by
Twobit/Larceny in 2000, with runtime checking disabled.

    static int natsum (int m, int n) {
        if (m > 0)
            return natsum (m - 1, n + 1);
        else
            return n;
    }

L1002
        .proc
        .proc-doc   #(.natsum|4 #f 2 #f #f (m n))
        reg/op2imm/branchf internal:branchf-fx>/imm,1,0,1014
        reg/op2imm/setreg internal:fx-/imm,1,1,1
        reg/op2imm/setreg internal:fx+/imm,2,1,2
        branch      1002,2
L1014
        reg/return  2

360     subcc   %r1, 0, %g0
364     ble,a   #412
368     ld      [ %stkp + 4 ], %o7
372     ble,a   #412
376     ld      [ %stkp + 4 ], %o7
380     sub     %r1, 4, %r1
384     add     %r2, 4, %r2
388     subcc   %timer, 1, %timer
392     bne,a   #364
396     subcc   %r1, 0, %g0
400     jmpl    %globals + 1376, %o7    ! timer-exception
404     add     %o7, -48, %o7
408     ld      [ %stkp + 4 ], %o7
412     jmpl    %o7 + 8, %g0
416     or      %g0, %r2, %result

****************************************************************

Timings on a SunBlade 100, March 2002.
Larceny v0.50, Chez Scheme v6.1, Sun Java HotSpot version 1.3.0.

                                                                      Loops
quirk19           10.40  ******************                                
Larceny            7.19  **************************                        
Chez              14.30  *************                                     
Java               3.80  **************************************************

                                                                  TailCalls
quirk19           31.30  *************                                     
Larceny            8.16  **************************************************
Chez              14.29  *****************************                     
Java             500.60  *                                                 

                                                             DeepRecursions
quirk19          193.27  *******                                           
Larceny           25.84  **************************************************
chez              26.48  ************************************************* 
Java             497.90  ***                                               

                                                             TreeRecursions
quirk19           20.20  ***************                                   
Larceny            6.24  **************************************************
Chez               6.69  ***********************************************   
Java               6.73  **********************************************    

                                                                  Integrate
quirk19           83.63  ****                                              
Larceny           85.33  ****                                              
Chez              47.82  *******                                           
Java               6.70  **************************************************

                                                              SelectionSort
quirk19           31.53  *******************                               
Larceny           26.55  **********************                            
Chez              45.43  *************                                     
Java              11.80  **************************************************

                                                            Permutations9:1
quirk19            6.20  *************                                     
Larceny            1.61  **************************************************
Chez               2.40  **********************************                
Java               8.83  *********                                         

                                                           Permutations8:50
quirk19           12.03  ************                                      
Larceny            2.93  **************************************************
Chez               3.69  ****************************************          
Java               9.47  ***************                                   

                                                                      Sieve
quirk19            6.33  **                                                
Larceny            0.47  *******************************                   
Chez               0.29  **************************************************
Java               2.96  *****

****************************************************************

Interpretation of benchmark results.

Benchmark results don't lie, but they're usually misleading.
In my opinion, benchmarking is worse than useless unless you
take the trouble to understand what is going on.  That's why
I like micro-benchmarks like the nine above.  They aren't
representative, but they are small enough to be understood,
and small enough to isolate important differences between
compilers and runtime systems.

The Loops, TailCalls, and DeepRecursions benchmark each consist
of a nested pair of loops, each of which counts down to 0.  They
differ in how the inner loop is expressed.  In Loops, the inner
loop is a while loop.  In TailCalls, the inner loop is an
equivalent self-tail-recursion.  In DeepRecursions, the inner
loop is an equivalent non-tail-recursion.  The inner loop is
executed 10000 times for 50000 iterations of the outer loop.

On Loops, Java is fastest, mainly because it performs arithmetic
mod 2^32.

On TailCalls, Java is slowest because the designers of the JVM
made it hard to implement proper tail recursion, and the
implementors of the Java compiler didn't bother to perform tail
call optimization even in easy cases.  Larceny and Chez Scheme
generate exactly the same code for the inner loop on Loops and
TailCalls; the small change in the Larceny timing is repeatable,
so it is probably caused by a change in instruction alignment.

On DeepRecursions, Java performs no slower than on TailCalls,
because it had compiled the tail calls as a deep recursion.  The
reason Java is so much slower than Larceny and Chez Scheme on
deep recursions is that Sun's implementation of Java uses the
SPARC's standard calling conventions, which use the SPARC's
register windows.  There are only about 20 register windows
on-chip.  (There's room for more, but Sun doesn't want Unix
context switches to become too slow.)  Every 20 calls, the
Java code will suffer a register window overflow exception.
This exception takes time to process, and it has to copy the
register windows to main memory.  Every 20 returns, the Java
code will suffer a register window underflow exception, which
copies register windows back from main memory.  Furthermore each
register window consists of 96 bytes, which not coincidentally
is the smallest stack frame allowed by the SPARC's standard
calling conventions.  In Larceny, each stack frame is only 16
bytes, which is the smallest stack frame allowed by Larceny's
calling conventions.  Thus Java performs 12 times as many loads
and stores as Larceny, and must also process many overflow and
underflow exceptions.  No wonder it's almost 20 times as slow.

TreeRecursions features the shallow recursions on which the
SPARC's register windows really shine, so why doesn't Java
outperform Larceny and Chez Scheme on this benchmark?  Because
the Java compiler allocates a register window for every call.
The Scheme compilers allocate a stack frame only when necessary;
they don't allocate a frame for the base case of the recursion,
which accounts for about half of the calls.

Integrate is a floating point benchmark.  The Scheme systems do
badly because Scheme's arithmetic is dynamically generic, and
the Scheme compilers generate inline code only for small integer
arithmetic.  When the operands are floating point numbers, the
arithmetic operation calls library code that tests to learn that
the operands are in floating point format, loads them into the
floating point registers, allocates heap storage for the result,
and stores the result into that newly allocated storage.  Java's
arithmetic primitives know that their operands are in floating
point format, and can leave most of the intermediate results in
registers.

SelectionSort, whose code you will see later, is mainly a test
of arrays.  All of the tested systems are safe, but the HotSpot
JIT compiler probably does a better job of eliminating dynamic
range checks on this benchmark.

Permutations9:1 is a pure storage allocation benchmark.  All of
the allocated storage becomes part of the result, so no garbage
is generated, and any attempts to collect garbage are a waste of
time.  Larceny and Chez Scheme use highly bummed sequences of
inline code to allocate storage.  Java's storage allocators are
probably not as efficient, and Java's object constructors allow
arbitrary initialization code but are therefore less efficient
than Scheme's cons.

Permutations8:50 is a garbage collection benchmark in which all
permutations of 8 things are computed and then thrown away 50
times (after a 1-iteration delay, which makes this a much more
severe test of garbage collection than benchmarks that discard
the result immediately after it is computed).  HotSpot performs
respectably, but HotSpot's generational garbage collector
probably isn't as fast as the generational collectors in Larceny
and Chez Scheme.  By the way, both C and C++ tend to perform
poorly on this benchmark unless a custom allocator/deallocator
is written, because the standard implementations of free and
delete must touch every deallocated node.  Generational garbage
collectors do not have to do that.

Sieve is a higher order benchmark that computes all the prime
numbers.  It would take forever to print them out, and we'd run
out of memory before that, so the benchmark just counts how many
primes are less than 10000.  The Java version of this benchmark
uses function objects.  Why is allocating and initializing a
function object so much slower than creating a closure?  For the
same reasons that Scheme cons is faster than Java or C++ new.

****************************************************************

Boolean evaluation for control.

Most boolean expressions are evaluated for their effect on the
control flow of a program, not to generate an actual boolean
value.  The MacScheme machine does not provide any way to avoid
generating an actual boolean value, but Larceny's peephole
optimization usually eliminates the boolean.

class Foo {
    static void main (String[] args) { }

    static void f (Foo x) {
        if (x == null) skip; else skip;
    }
}

        reg         1
        op2imm      eq?,()
        branchf     1004,31


        subcc   %r1, 10, %g0
        bne,a   #28
        nop

****************************************************************

Control optimization.

Twobit contains an entire module, with over 600 lines of code,
that optimizes conditional expressions.

Consider a switch statement in Java.  The compiler has several
choices for the algorithm that will be used at runtime to
select the appropriate case:

    sequential search
    binary search
    jump table

On a Sun Ultra 1, sequential search is fastest when there are
fewer than 8 cases.  Binary search is fastest is most other
cases, but a jump table is fastest if there are a large number
of distinct cases that are dense within some interval.

****************************************************************

Assembly-level optimization.
    peephole optimization
    filling of branch delay slots
    inline allocation

These optimizations typically improve the performance of quirk19
code by 15 - 40%.

With a good code generator, peephole optimization becomes less
important.  In Twobit, however, the existence of a good peephole
optimizer allows the code generator to be sloppy about things
like boolean evaluation of control.  This makes it easier for
the code generator to do a good job on things that cannot be
optimized by a peephole optimizer.

****************************************************************

Local optimization
    parallel assignment optimization
    scheduling of load instructions, et cetera

A local optimization works with the code for a basic block.

Local optimizations are important because they reduce some of
the obvious inefficiencies in the generated code.

        return natsum (m - 1, n + 1);

        reg         1
        op2imm      fx-,1
        setreg      3
        reg         2
        op2imm      fx+,1
        setreg      4
        movereg     3,1
        movereg     4,2
        global      tailcalls.natsum
        invoke      2

        reg         1,.m|19
        op2imm      fx-,1
        setreg      1
        reg         2,.n|19
        op2imm      fx+,1
        setreg      2
        global      tailcalls.natsum
        invoke      2

Local optimizations also improve the low-level parallelism of
the code.  For example, the load instructions of many RISC
machines require more than one cycle to complete, even if they
hit the data cache.  An instruction that uses the value that
is loaded by a load instruction may have to wait for the load
to complete.  By moving load instructions as early as possible
within a straight-line sequence of machine code, we make it
less likely that subsequent instructions will have to wait for
their data.

Scheduling optimizations can interact strongly with the dynamic
scheduling that is performed by current superscalar processors.

****************************************************************

Trace optimization
    deletion of locally redundant instructions
    branch tensioning

A trace is a finite sequence of basic blocks.

Simple code generators tend to produce redundant instructions,
such as load instructions that restore a register whose contents
will never be used.

They also tend to produce branch instructions that branch to
other branch instructions, or branch to a label that immediately
follows the branch.

Trace optimizations can clean up these inefficiencies.

Trace optimization is also important for scheduling, especially
on very-long-word-instruction (VLIW) architectures such as the
Intel Itanium.

Trace optimization can be more effective in a just-in-time (JIT)
compiler than in a conventional compiler, because a JIT compiler
can optimize traces that cross module boundaries, as when a
program calls a library routine.

****************************************************************

Loop optimization
    hoisting of loop-invariant computations

Most programs spend most of their time in a few inner loops, so
it is important to make those loops run as fast as possible.

    static void selectionsort (int[] a, int n) {
        int i = 0;
        int j = 0;
        int k = 0;

        //  Loop invariant:
        //  0 <= i <= n, and
        //  the contents of the array are a permutation of
        //      the original contents, and
        //  a[0] through a[i-1] are in nondecreasing order, and
        //  a[0] through a[i-1] are less than or equal to every
        //      element in a[i] through a[n-1].

        while (i < n) {
            j = i;
            k = i;

            //  Loop invariant:
            //  i <= j < n, and
            //  i <= k <= n, and
            //  a[j] is least among a[i] through a[k-1].

            while (k < n) {
                if (a[j] > a[k])
                    j = k;
                else;
                k = k + 1;
            }
            swap (a, i, a, j);
            i = i + 1;
        }
    }

The compiler should keep a, j, k, and n in registers throughout
the innermost loop.  It should also keep the length of a in a
register, since that length is used implicitly by a dynamic
range check (assuming a[j] and a[k] are evaluated safely).

If the compiler can prove that n is the length of the array a,
that i >= 0, and that the first two parts of the loop invariant
shown in the comment really are loop invariant, then the
compiler can eliminate the runtime range checks without
compromising safety.

Some compilers would replace j, k, and n by equivalent variables
throughout the innermost loop, because this eliminates implicit
multiplications by 4 inside the loop.  This optimization is
called reduction in strength.

            int j4 = 4 * j;
            int k4 = 4 * k;
            int n4 = 4 * n;
            while (k4 < n4) {
                if ((* (a + j4)) > (* (a + k4)))
                    j4 = k4;
                else;
                k4 = k4 + 4;
            }
            j = j4 / 4;
            k = k4 / 4;

With Larceny's representation of integers, there is no need for
this.

****************************************************************

Intraprocedural optimization
    copy propagation
    common subexpression elimination
    dead code elimination
    dead variable elimination
    register allocation

Intraprocedural optimizations are restricted to a single
procedure.  They are also known as "global" optimizations.

    static List revloop (List x, int n, List y) {
        if (n == 0)
            return y;
        else return revloop (x.tail,
                             n - 1,
                             List.cons (x.head, y));
    }

For safety, the compiler must check to ensure that x is not null
before it evaluates x.tail and x.head.  Without optimization,
the compiler would check twice.  With common subexpression
elimination, the compiler would check only once.  If x is not
null when x.tail is evaluated, then it won't be null when x.head
is evaluated.

Common subexpression elimination is also important for array
addressing.  For example, it eliminates the second implicit
multiplication in

    a[i] = a[i] + 1;

Register allocation is important because it reduces the number
of load and store instructions, and because it can sometimes
eliminate the need to allocate a stack frame.

****************************************************************

Interprocedural optimization
    lambda lifting
    register targeting
    inlining of small procedures
    constant propagation
    constant folding
    representation and subrange inference

Interprocedural optimizations can operate across procedure
boundaries.

Lambda lifting and closure conversion are optimizations that
reduce the number of closures that need to be created.  In
effect, they rewrite the program to increase the number of
leaf1 procedures.

Register targeting reduces the number of register-to-register
move instructions by heuristic selection of the registers in
which arguments are passed.

****************************************************************

Inlining, constant propagation, and constant folding allow the
compiler to compile

class PseudoRandom {

    ...

    static int random (PseudoRandom generator) {
        int r = mod (multiplier * generator.rand, divisor);
        generator.rand = r;
        return r;
    }

    static final int seed = 21;
    static final int multiplier = 1731;
    static final int divisor = 4001;

    int rand;

    static int mod (int m, int n) {
        return m - n * (m / n);
    }

}

as if it were

class PseudoRandom {

    ...

    //  Returns a pseudo-random number.

    static int random (PseudoRandom generator) {
        int r;
        {
            int m = 1731 * generator.rand;
            int n = 4001;
            r = m - n * (m / n);
        }
        generator.rand = r;
        return r;
    }

    static final int seed = 21;
    static final int multiplier = 1731;
    static final int divisor = 4001;

    int rand;

    static int mod (int m, int n) {
        return m - n * (m / n);
    }

}

Intraprocedural copy propagation and dead code elimination can
then convert this into

class PseudoRandom {

    ...

    //  Returns a pseudo-random number.

    static int random (PseudoRandom generator) {
        int r;
        {
            int m = 1731 * generator.rand;
            r = m - 4001 * (m / 4001);
        }
        generator.rand = r;
        return r;
    }

    static final int seed = 21;
    static final int multiplier = 1731;
    static final int divisor = 4001;

    int rand;

    static int mod (int m, int n) {
        return m - n * (m / n);
    }

}

****************************************************************

Interprocedural representation and subrange inference reduce the
number of runtime safety checks.

For example, a Java compiler must generate code that checks
every array reference a[i] to ensure that a is not null and
that 0 <= i < a.length.  Scheme is dynamically typed, so a
Scheme compiler must also check that a is an array.

Every single one of these runtime checks can be eliminated from
the following program without compromising safety.

class SelectionSort {

    static void main (String[] args) {
        int n = Int.parseInt (args[0]);
        benchmark (n);
    }

    static void benchmark (int n) {
        int[] a = testCase (n);
        selectionsort (a, n);
    }

    static int[] testCase (int n) {
        int[] a = new int [n];
        PseudoRandom generator = PseudoRandom.create();
        for (int i = 0; i < n; i = i + 1)
            a[i] = PseudoRandom.random (generator);
        return a;
    }

    static void selectionsort (int[] a, int n) {
        int i = 0;
        int j = 0;
        int k = 0;
        while (i < n) {
            j = i;
            k = i;
            while (k < n) {
                if (a[j] > a[k])
                    j = k;
                else;
                k = k + 1;
            }
            swap (a, i, a, j);
            i = i + 1;
        }
    }

    static void swap (int[] a, int i, int[] b, int j) {
        int temp = a[i];
        a[i] = b[j];
        b[j] = temp;
    }
}

****************************************************************