Dive Into Systems

12. Code Optimization

Code optimization is the process by which a program is improved by reducing its code size, complexity, memory use, or runtime (or some combination thereof) without changing the program’s inherent function. Many compilation systems include a code optimizer as an intermediate step. Specifically, an optimizing compiler applies code-improving transformations as part of the compilation process. Virtually all modern compilers (including GCC) are optimizing compilers. The GCC C compiler implements a wide variety of optimization flags that give programmers direct access to a subset of the implemented optimizations. Compiler optimization flags optimize code at the expense of compile time and ease of debugging. For simplicity, GCC wraps up a subset of these optimization flags into different optimization levels that the programmer can directly invoke. For example, the following command compiles a sample program with level 1 optimizations:

$ gcc -O1 -o program program.c

The level 1 (-O1 or -O) optimizations in GCC perform basic optimizations to reduce code size and execution time while attempting to keep compile time to a minimum. Level 2 (-O2) optimizations include most of GCC’s implemented optimizations that do not involve a space-performance trade-off. Lastly, level 3 (-O3) performs additional optimizations (such as function inlining, discussed later in this chapter), and may cause the program to take significantly longer to compile. The GCC documentation describes the implemented optimization flags in detail.

A detailed discussion of optimizing compilers and their construction and operation is beyond the scope of this textbook; we encourage interested readers to check out the seminal text, Compilers: Principles, Techniques, and Tools, by Aho, Sethi, and Ulman. Rather, the purpose of the chapter is to highlight some things that most compilers can (and cannot) do, and how programmers can partner with their compilers and profiling tools to help improve their code.

What Compilers Already Do

Several of the common optimizations performed by virtually every compiler are described briefly in the upcoming sections. Students should never manually implement these optimizations, because they are already implemented by the compiler.

Constant Folding: Constants in the code are evaluated at compile time to reduce the number of resulting instructions. For example, in the code snippet that follows, macro expansion replaces the statement int debug = N-5 with int debug = 5-5. Constant folding then updates this statement to int debug = 0.

#define N 5
int debug = N - 5; //constant folding changes this statement to debug = 0;

Constant Propagation: Constant propagation replaces variables with a constant value if such a value is known at compile time. Consider the following code segment:

int debug = 0;

//sums up all the elements in an array
int doubleSum(int *array, int length){
    int i, total = 0;
    for (i = 0; i < length; i++){
        total += array[i];
        if (debug) {
            printf("array[%d] is: %d\n", i, array[i]);
        }
    }
    return 2 * total;
}

A compiler employing constant propagation will change if (debug) to if (0).

Dead Code Elimination: It is not uncommon for a program to be littered with unused variables, assignments, or statements. Even though these unneeded statements are rarely introduced intentionally, they are often a natural by-product of the constant iteration and refinement of the software development cycle. If left undetected, these so-called dead code sequences can cause compilers to output unnecessary assembly instructions that in turn waste processing time. Most compilers employ techniques such as dataflow analysis to identify unreachable code segments and thereby remove them. Dead code elimination often makes a program faster by shrinking code size and the associated set of instructions. As an example, let’s revisit the doubleSum function in which the compiler employed constant propagation to replace debug with 0 in the if statement:

int debug = 0;

//sums up all the elements in an array
int doubleSum(int *array, int length){
    int i, total = 0;
    for (i = 0; i < length; i++){
        total += array[i];
        if (0) { //debug replaced by constant propagation by compiler
            printf("array[%d] is: %d\n", i, array[i]);
        }
    }
    return 2 * total;
}

A compiler employing dataflow analysis recognizes that the if statement always evaluates to false and that the printf statement never executes. The compiler therefore eliminates the if statement and the call to printf in the compiled executable. Another pass also eliminates the statement debug = 0.

Simplifying expressions: Some instructions are more expensive than others. For example, the imul and idiv arithmetic instructions in assembly take a long time to execute. Compilers commonly attempt to reduce the number of expensive instructions by simplifying mathematical operations whenever possible. For example, in the doubleSum function, the compiler may replace the expression 2 * total with total + total because the addition instruction is less expensive than multiplication:

//declaration of debug removed through dead-code elimination

//sums up all the elements in an array
int doubleSum(int *array, int length){
    int i, total = 0;
    for (i = 0; i < length; i++){
        total += array[i];
        //if statement removed through data-flow analysis
    }
    return total + total; //simplifying expression
}

Likewise, the compiler will transform code sequences with bit-shifting and other bitwise operators to simplify expressions. For example, the compiler may replace the expression total * 8 with total << 3, or the expression total % 8 with total & 7 given that bitwise operations are performed with a single fast instruction.

What Compilers Cannot Always Do: Benefits of Learning Code Optimization

Given the benefits of optimizing compilers, it may not be immediately obvious why learning code optimization is useful. It may be tempting to think of the compiler as a magical black box that is "smart." At the end of the day, the compiler is a piece of software that performs a series of code transformations in an effort to speed up code. Compilers are also limited in the types of optimizations they can perform.

Algorithmic Strength Reduction Is Impossible: The top reason for poor code performance is bad choices of data structures and algorithms. Compilers cannot magically fix these bad decisions. For example, a compiler will never optimize a program implementing bubble sort into one that implements quick sort. While the sophistication of compilers and their optimizations continues to improve, the quality of any individual compiler’s optimizations varies between platforms. The onus is therefore on the programmer to ensure that their code leverages the best algorithms and data structures.
Compiler Optimization Flags Are Not Guaranteed to Make Code "Optimal" (or Consistent): Increasing the level of compiler optimizations (e.g., from -O2 to -O3) may not always decrease the runtime of a program. Sometimes, the programmer may discover that updating the optimization flags from -O2 to -O3 slows down a program, or yields no performance increase at all. In other cases, a programmer may discover that a program compiled without the optimization flags seemingly yields no errors, whereas compiling it with -O2 or -O3 results in segmentation faults or other errors. These types of programming errors are especially difficult to debug, because GCC’s debug (-g) flag is incompatible with its optimization (-O) flags, as the transformations performed by compiler optimizations at the -O levels interfere with the debugger’s ability to analyze the underlying code. The -g flag is required by many common profiling tools, such as GDB and Valgrind.

One large reason for inconsistent behavior is that the C/C++ standard does not provide clear guidance for resolving undefined behavior. As a result, it is often up to the compiler to decide how to resolve ambiguity. Inconsistencies on how different optimization levels handle undefined behavior can cause answers to change. Consider the following example from John Regehr¹:

int silly(int a) {
  return (a + 1) > a;
}

Suppose that silly was run with a = INT_MAX. In this case, the computation a 1 results in integer overflow. However, the C/C++ standard does not define how integer overflow should be handled by the compiler. In fact, compiling the program with no optimizations causes the function to return 0, while compiling it with -O3 optimizations results in the function returning 1.

In short, optimization flags should be used with caution, thoughtfully, and when necessary. Learning which optimization flags to employ can also help the programmer work with their compiler instead of against it.

The compiler is not required to handle undefined behavior

The silly function when run with a = INT_MAX is an example of undefined behavior. Note that the inconsistent output produced by the compiler is not a flaw in the compiler’s design or a consequence of using optimization flags. Compilers are specifically designed to follow a language’s specification. The C Language standard does not specify what a compiler should do when it encounters undefined behavior; the program may crash, fail to compile, or generate inconsistent or incorrect results. Ultimately, the programmer is responsible for identifying and eliminating undefined behavior in code. Whether silly should return 0, 1, or some other value is ultimately a decision the programmer must make. To learn more about undefined behavior and related issues in C programs, visit the C FAQ² or John Regehr’s blog¹.

Pointers Can Prove Problematic: Recall that the compiler makes transformations that leave the fundamental behavior of the source program unchanged. If a transformation risks changing the behavior of the program, the compiler will not make the transformation. This is especially true in the case of memory aliasing where two different pointers point to the same address in memory. As an example, consider the following function shiftAdd that takes two integer pointers as its two parameters. The function multiplies the first number by 10 and adds the second number to it. So, if the shiftAdd function were passed the integers 5 and 6, the result will be 56.

Table 1. Comparison of two functions that multiplies the first number by 10 and adds the second to it. Available at this link.
Unoptimized Version	Optimized Version
`void shiftAdd(int a, int b){ a = a * 10; //multiply by 10 a += b; //add b }`	`void shiftAddOpt(int a, int b){ a = (a * 10) + *b; }`

The shiftAddOpt function optimizes the shiftAdd function by removing an additional memory reference to a, resulting in a smaller set of instructions in the compiled assembly. However, the compiler will never make this optimization due to the risk of memory aliasing. To understand why, consider the following main function:

int main(void){
    int x = 5;
    int y = 6;
    shiftAdd(&x, &y); //should produce 56
    printf("shiftAdd produces: %d\n", x);

    x = 5; //reset x
    shiftAddOpt(&x, &y); //should produce 56
    printf("shiftAddOpt produces: %d\n", x);

    return 0;

}

Compiling and running this program gives the expected output:

$ gcc -o shiftadd shiftadd.c
$ ./shiftadd
shiftAdd produces: 56
shiftAddOpt produces: 56

Suppose, instead, that the program were modified so that shiftAdd now takes a pointer to x as its two parameters:

int main(void){
    int x = 5;
    shiftAdd(&x, &x); //should produce 55
    printf("shiftAdd produces: %d\n", x);

    x = 5; //reset x
    shiftAddOpt(&x, &x); //should produce 55
    printf("shiftAddOpt produces: %d\n", x);

    return 0;

}

The expected output is 55. However, recompiling and rerunning the updated code gives two different outputs:

$ gcc -o shiftadd shiftadd.c
$ ./shiftadd
shiftAdd produces: 100
shiftAddOpt produces: 55

Retracing through the shiftAdd functions with the assumption that a and b are pointing to the same memory location reveals the issue. The multiplication of a by 10 in shiftAdd updates x to 50. Next, adding a to b in shiftAdd results in x doubling to 100. The risk of memory aliasing reveals that shiftAdd and shiftAddOpt are not in fact equivalent, though the programmer may have intended them to be. To fix this issue, recognize that the second parameter of shiftAdd does not need to be passed in as a pointer. Replacing the second parameter with an integer eliminates the risk of aliasing and allows the compiler to optimize one function into the other:

Table 2. Improved functions that multiply the first number by 10 and adds the second to it. Available at this link.
Unoptimized (Fixed) Version	Optimized Version (Fixed)
`void shiftAdd(int a, int b){ a = a 10; //multiply by 10 *a += b; //add b }`	`void shiftAddOpt(int a, int b){ a = (a 10) + b; }`

Removing the unneeded memory reference allows the programmer to maintain the readability of the original shiftAdd function while enabling the compiler to optimize the function.

Partnering with Your Compiler: A Sample Program

In the following sections, we concentrate on learning more about popular types of optimizations and discuss programming and profiling strategies to help make it easier for compilers to optimize our code. To guide our discussion, we will work to optimize the following (suboptimally written) program that attempts to find all the prime numbers between 2 and n (source code available at this link):

//helper function: checks to see if a number is prime
int isPrime(int x) {
    int i;
    for (i = 2; i < sqrt(x) + 1; i++) { //no prime number is less than 2
        if (x % i == 0) { //if the number is divisible by i
            return 0; //it is not prime
        }
    }
    return 1; //otherwise it is prime
}

// finds the next prime
int getNextPrime(int prev) {
    int next = prev + 1;
    while (!isPrime(next)) { //while the number is not prime
        next++; //increment and check again
    }
    return next;
}

// generates a sequence of primes
int genPrimeSequence(int *array, int limit) {
    int i;
    int len = limit;
    if (len == 0) return 0;
    array[0] = 2; //initialize the first number to 2
    for (i = 1; i < len; i++) {
        array[i] = getNextPrime(array[i-1]); //fill in the array
        if (array[i] > limit) {
            len = i;
            return len;
        }
    }
    return len;
}

int main(int argc, char **argv) {
  //omitted for brevity
  int *array = allocateArray(limit);
  int length = genPrimeSequence(array, limit);

  return 0;
}

Table 3 shows the timing results for producing the primes between 2 and 5,000,000 with the different optimization level flags using the following basic compilation command:

$ gcc -o optExample optExample.c -lm

Table 3. Time in Seconds to Produce Prime numbers between 2 and 5,000,000
Unoptimized	`-O1`	`-O2`	`-O3`
3.86	2.32	2.14	2.15

The fastest observed time with optimization flags is approximately 2.14 seconds. Although using optimization flags does shave off more than a second from the runtime of this program, upping the optimization flags provides minimal improvement. In the next sections, we will discuss how we can modify our program to make it easier for the compiler to optimize.

References

John Regehr. "A Guide to Undefined Behavior in C and C++, Part 1". https://blog.regehr.org/archives/213
C FAQ. "comp.lang.c FAQ list: Question 11.33". http://c-faq.com/ansi/undef.html