Dive Into Systems

12.1. Code Optimization First Steps: Code Profiling

"The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming." - Don Knuth, The Art of Computer Programming

One of the biggest dangers in code optimization is the concept of premature optimization. Premature optimization occurs when a programmer attempts to optimize based on "gut feelings" of where performance inefficiencies occur, and not on data. Whenever possible, it is important to measure the runtime of different portions of code on different inputs prior to starting optimization to identify hot spots or areas in the program in which the most instructions occur.

To figure out how to optimize optExample.c, let’s start by taking a closer look at the main function:

int main(int argc, char **argv) {
    // error-handling and timing code omitted for brevity

    int limit = strtol(argv[1], NULL, 10);
    int length = limit;
    int *array = allocateArray(length); //allocates array of specified length

    genPrimeSequence(array, limit, &length); //generates sequence of primes

    return 0;
}

The main function contains calls to two functions: allocateArray, which initializes an array of a user-specified length (or limit), and genPrimeSequence, which generates a sequence of primes within the specified limit (note that for any sequence between 2 and n, there cannot be more than n primes, and frequently there are significantly less). The main function in the C file contains code that times each of the two functions above. Compiling and running the code with limit set to 5,000,000 reveals the following:

$ gcc -o optExample optExample.c -lm
$ time -p ./optExample 5000000
Time to allocate: 5.5e-05
Time to generate primes: 3.85525
348513 primes found.
real 3.85
user 3.86
sys 0.00

The optExample program takes approximately 3.86 seconds to complete, with nearly all of the time in the genPrimeSequence function. There is no point in spending time optimizing allocateArray, because any improvements will be negligible to the runtime of the overall program. In the examples that follow, we focus more closely on the genPrimeSequence function and its associated functions. The functions are reproduced here for convenience:

// helper function: checks to see if a number is prime
int isPrime(int x) {
    int i;
    for (i = 2; i < sqrt(x) + 1; i++) { //no prime number is less than 2
        if (x % i == 0) { //if the number is divisible by i
            return 0; //it is not prime
        }
    }
    return 1; //otherwise it is prime
}

// finds the next prime
int getNextPrime(int prev) {
    int next = prev + 1;
    while (!isPrime(next)) { //while the number is not prime
        next++; //increment and check again
    }
    return next;
}

// generates a sequence of primes
int genPrimeSequence(int *array, int limit) {
    int i;
    int len = limit;
    if (len == 0) return 0;
    array[0] = 2; //initialize the first number to 2
    for (i = 1; i < len; i++) {
        array[i] = getNextPrime(array[i-1]); //fill in the array
        if (array[i] > limit) {
            len = i;
            return len;
        }
    }
    return len;
}

To find hot spots in a program, focus on the areas with the most loops. Manual inspection of code can assist in locating hot spots, though it should always be verified with benchmarking tools prior to attempting optimization. A manual inspection of the optExample program yields the following observations.

The genPrimeSequence function attempts to generate all the prime numbers between 2 and some integer n. Since the number of primes between 2 and n cannot exceed n, the for loop in genPrimeSequence runs no more than n times. Every iteration of the for loop calls the getNextPrime function once. Thus, getNextPrime runs no more than n times.
The while loop in the getNextPrime function will continue running until a prime is discovered. Although it is difficult to determine the number of times the while loop in the getNextPrime function will execute ahead of time as a function of n (the gap between consecutive prime numbers can be arbitrarily large), it is certain that isPrime executes on every iteration of the while loop.
The isPrime function contains exactly one for loop. Suppose that the loop runs for a total of k iterations. Then, the code in the loop body runs k times in total. Recall that the structure of a for loop consists of an initialization statement (which initializes the loop variable to a particular value), a Boolean expression (that determines when to terminate the loop), and a step expression (that updates the loop variable every iteration). Table 1 depicts the number of times each loop component executes in a for loop that runs for k iterations. In every for loop, initialization happens exactly once. The Boolean expression executes k + 1 times for k iterations, since it must perform one final check to terminate the loop. The loop body and the step expression execute k times each.

Table 1. Loop execution components (assuming k iterations)
Initialization statement	Boolean expression	Step expression	Loop body
1	k+1	k	k

Our manual inspection of the code suggests that the program spends most of its time in the isPrime function, and that the sqrt function executes the most often. Let’s next use code profiling to verify this hypothesis.

12.1.1. Using Callgrind to Profile

In our small program, it was relatively straightforward to use manual inspection to form the hypothesis that the sqrt function occurs in a "hot spot" in the code. However, identifying hot spots can become more complex in larger programs. Regardless, it is a good idea to use profiling to verify our hypothesis. Code profiling tools like Valgrind provide a lot of information about program execution. In this section, we use the callgrind tool to inspect the OptExample program’s call graph.

To use callgrind, let’s start by recompiling the optExample program with the -g flag and running callgrind on a smaller range (2 to 100,000). Like other Valgrind applications, callgrind runs as a wrapper around a program, adding annotations such as the number of times functions execute and the total number of instructions that are executed as a result. Consequently, the optExample program will take longer to execute when run in conjunction with callgrind.

$ gcc -g -o optExample optExample.c -lm
$ valgrind --tool=callgrind ./optExample 100000
==32590== Callgrind, a call-graph generating cache profiler
==32590== Copyright (C) 2002-2015, and GNU GPL'd, by Josef Weidendorfer et al.
==32590== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==32590== Command: ./optExample 100000
==32590==
==32590== For interactive control, run 'callgrind_control -h'.
Time to allocate: 0.003869
Time to generate primes: 0.644743
9592 primes found.
==32590==
==32590== Events    : Ir
==32590== Collected : 68338759
==32590==
==32590== I   refs:      68,338,759

Typing ls at the terminal reveals a new file called callgrind.out.xxxxx, where xxxxx is a unique id. In this case, the file is callgrind.out.32590 (i.e., the number shown along the left-hand column in the preceding output). Running callgrind_annotate on this file yields additional information on the three functions of interest:

$ callgrind_annotate --auto=yes callgrind.out.32590
 ----------------------------------------------------------------
Profile data file 'callgrind.out.32393' (creator: callgrind-3.11.0)
 ----------------------------------------------------------------
...
  .  //helper function: checks to see if a number is prime
   400,004  int isPrime(int x) {
         .      int i;
36,047,657      for (i = 2; i < sqrt(x)+1; i++) { //no prime is less than 2
13,826,015  => ???:sqrt (2765204x)
16,533,672          if (x % i == 0) { //if the number is divisible by i
   180,818              return 0; //it is not prime
         .          }
         .      }
     9,592      return 1; //otherwise it is prime
   200,002  }
         .
         .  // finds the next prime
    38,368  int getNextPrime(int prev) {
    28,776      int next = prev + 1;
   509,597      while (!isPrime(next)) { //while the number is not prime
67,198,556  => optExample.c:isPrime (100001x)
    90,409          next++; //increment and check again
         .      }
     9,592      return next;
    19,184  }
         .
         .  // generates a sequence of primes
         6  int genPrimeSequence(int * array, int limit) {
         .      int i;
         2      int len = limit;
         2      if (len == 0) return 0;
         2      array[0]=2; //initialize the first number to 2
    38,369      for (i = 1; i < len; i++) {
   143,880          array[i] = getNextPrime(array[i-1]); //fill in the array
67,894,482  => optExample.c:getNextPrime (9592x)
    76,736          if (array[i] > limit){
         2              len = i;
         2              return len;
         .          }
         .      }
         .      return len;
         4  }

The numbers along the left-hand column represent the number of total executed instructions associated with each line. The numbers in parentheses indicate the number of times a particular function was run. Using the numbers along the left-hand column, we are able to verify the results of our manual inspection. In the genPrimeSequence function, the getNextPrime function resulted in the most number of executed instructions at 67.8 million instructions, corresponding to 9,592 function calls (to generate the primes between 2 and 100,000). Inspecting getNextPrime reveals that the majority of those instructions (67.1 million, or 99%) result from the call to isPrime, which is called a total of 100,001 times. Lastly, inspecting isPrime reveals that 13 million of the total instructions (20.5%) result from the sqrt function, which executes a total of 2.7 million times.

These results verify our original hypothesis that the program spends most of its time in the isPrime function, with the sqrt function executing the most frequently of all the functions. Reducing the total number of executed instructions results in a faster program; the above analysis suggests that our initial efforts should concentrate on improving the isPrime function, and potentially reducing the number of times sqrt executes.

12.1.2. Loop-Invariant Code Motion

Loop-invariant code motion is an optimization technique that moves static computations that occur inside a loop to outside the loop without affecting the loop’s behavior. Optimizing compilers are capable of making most loop-invariant code optimizations automatically. Specifically, the -fmove-loop-invariants compiler flag in GCC (enabled at level -O1) attempts to identify examples of loop-invariant code motion and move them outside their respective loop.

However, the compiler cannot always identify cases of loop-invariant code motion, especially in the case of function calls. Since function calls can inadvertently cause side effects (unintended behavior), most compilers will avoid trying to determine whether a function call consistently returns the same result. Thus, even though the programmer knows that sqrt(x) always returns the square root of some input x, GCC will not always make that assumption. Consider the case where the sqrt function updates a secret global variable, g. In that case, calling sqrt once outside of the function (one update to g) is not the same as calling it every iteration of the loop (n updates to g). If a compiler cannot determine that a function always returns the same result, it will not automatically move the sqrt function outside the loop.

However, the programmer knows that moving the computation sqrt(x) + 1 outside the for loop does not effect the loop’s behavior. The updated function is shown here and is available in the following file:

//helper function: checks to see if a number is prime
int isPrime(int x) {
    int i;
    int max = sqrt(x)+1;
    for (i = 2; i < max; i++) { //no prime number is less than 2
        if (x % i == 0) { //if the number is divisible by i
            return 0; //it is not prime
        }
    }
    return 1; //otherwise it is prime
}

Table 2 shows that this simple change shaves off a full two seconds (47%) of the runtime of optExample2, even before using compiler flags. Furthermore, the compiler seems to have a slightly easier time optimizing optExample2.

Table 2. Time in Seconds to Produce the Prime Numbers Between 2 and 5,000,000
Version	Unoptimized	`-O1`	`-O2`	`-O3`
Original	3.86	2.32	2.14	2.15
With loop-invariant code motion	1.83	1.63	1.71	1.63

Rerunning callgrind on the optExample2 executable reveals why such a large improvement in runtime was observed. The following code snippet assumes that the file callgrind.out.30086 contains the annotations of running callgrind on the optExample2 executable:

$ gcc -g -o optExample2 optExample2.c -lm
$ valgrind --tool=callgrind ./optExample2 100000
$ callgrind_annotate --auto=yes callgrind.out.30086
 ------------------------------------------------------------------
Profile data file 'callgrind.out.30086' (creator: callgrind-3.11.0)
 ------------------------------------------------------------------
 ...
   400,004  int isPrime(int x) {
         .      int i;
   900,013      int max = sqrt(x)+1;
   500,000  => ???:sqrt (100001x)
11,122,449      for (i = 2; i < max; i++) { //no prime number is less than 2
16,476,120          if (x % i == 0) { //if the number is divisible by i
   180,818              return 0; //it is not prime
         .          }
         .      }
     9,592      return 1; //otherwise it is prime
   200,002  }
         .
         .  // finds the next prime
    38,368  int getNextPrime(int prev) {
    28,776      int next = prev + 1;
   509,597      while (!isPrime(next)) { //while the number is not prime
29,789,794  => optExample2.c:isPrime (100001x)
    90,409          next++; //increment and check again
         .      }
     9,592      return next;
    19,184  }

Moving the call to sqrt outside of the for loop reduces the number of times the sqrt function is called in the program from 2.7 million to 100,000 (96% reduction). This number corresponds to the number of times the isPrime function is called, confirming that the sqrt function executes only once with every invocation of the isPrime function.

Note that the compiler was able to perform significant levels of optimization when optimization flags were specified, even if the programmer does not manually perform code motion. In this case, the reason is due to a special instruction called fsqrt that is specified by the x86 ISA. When optimization flags are turned on, the compiler replaces all instances of the sqrt function with the fsqrt instruction. This process is known as inlining, and we cover it greater detail in the following section. Because fsqrt is no longer a function, it is easier for the compiler to identify its loop-invariant nature, and move it outside the body of the loop.