## 12.1. Code Optimization First Steps: Code Profiling

"The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming." - Don Knuth, The Art of Computer Programming

One of the biggest dangers in code optimization is the concept of premature optimization. Premature optimization occurs when a programmer attempts to optimize based on "gut feelings" of where performance inefficiencies occur, and not on data. Whenever possible, it is important to measure the runtime of different portions of code on different inputs prior to starting optimization to identify hot spots or areas in the program in which the most instructions occur.

To figure out how to optimize optExample.c, let’s start by taking a closer look at the `main` function:

``````int main(int argc, char **argv) {
// error-handling and timing code omitted for brevity

int limit = strtol(argv, NULL, 10);
int length = limit;
int *array = allocateArray(length); //allocates array of specified length

genPrimeSequence(array, limit, &length); //generates sequence of primes

return 0;
}``````

The `main` function contains calls to two functions: `allocateArray`, which initializes an array of a user-specified length (or limit), and `genPrimeSequence`, which generates a sequence of primes within the specified limit (note that for any sequence between 2 and n, there cannot be more than n primes, and frequently there are significantly less). The `main` function in the C file contains code that times each of the two functions above. Compiling and running the code with `limit` set to 5,000,000 reveals the following:

```\$ gcc -o optExample optExample.c -lm
\$ time -p ./optExample 5000000
Time to allocate: 5.5e-05
Time to generate primes: 3.85525
348513 primes found.
real 3.85
user 3.86
sys 0.00```

The `optExample` program takes approximately 3.86 seconds to complete, with nearly all of the time in the `genPrimeSequence` function. There is no point in spending time optimizing `allocateArray`, because any improvements will be negligible to the runtime of the overall program. In the examples that follow, we focus more closely on the `genPrimeSequence` function and its associated functions. The functions are reproduced here for convenience:

``````// helper function: checks to see if a number is prime
int isPrime(int x) {
int i;
for (i = 2; i < sqrt(x) + 1; i++) { //no prime number is less than 2
if (x % i == 0) { //if the number is divisible by i
return 0; //it is not prime
}
}
return 1; //otherwise it is prime
}

// finds the next prime
int getNextPrime(int prev) {
int next = prev + 1;
while (!isPrime(next)) { //while the number is not prime
next++; //increment and check again
}
return next;
}

// generates a sequence of primes
int genPrimeSequence(int *array, int limit) {
int i;
int len = limit;
if (len == 0) return 0;
array = 2; //initialize the first number to 2
for (i = 1; i < len; i++) {
array[i] = getNextPrime(array[i-1]); //fill in the array
if (array[i] > limit) {
len = i;
return len;
}
}
return len;
}``````

To find hot spots in a program, focus on the areas with the most loops. Manual inspection of code can assist in locating hot spots, though it should always be verified with benchmarking tools prior to attempting optimization. A manual inspection of the `optExample` program yields the following observations.

• The `genPrimeSequence` function attempts to generate all the prime numbers between 2 and some integer n. Since the number of primes between 2 and n cannot exceed n, the `for` loop in `genPrimeSequence` runs no more than n times. Every iteration of the `for` loop calls the `getNextPrime` function once. Thus, `getNextPrime` runs no more than n times.

• The `while` loop in the `getNextPrime` function will continue running until a prime is discovered. Although it is difficult to determine the number of times the `while` loop in the `getNextPrime` function will execute ahead of time as a function of n (the gap between consecutive prime numbers can be arbitrarily large), it is certain that `isPrime` executes on every iteration of the `while` loop.

• The `isPrime` function contains exactly one `for` loop. Suppose that the loop runs for a total of k iterations. Then, the code in the loop body runs k times in total. Recall that the structure of a for loop consists of an initialization statement (which initializes the loop variable to a particular value), a Boolean expression (that determines when to terminate the loop), and a step expression (that updates the loop variable every iteration). Table 1 depicts the number of times each loop component executes in a `for` loop that runs for k iterations. In every `for` loop, initialization happens exactly once. The Boolean expression executes k + 1 times for k iterations, since it must perform one final check to terminate the loop. The loop body and the step expression execute k times each.

Table 1. Loop execution components (assuming k iterations)
Initialization statement Boolean expression Step expression Loop body

1

k+1

k

k

Our manual inspection of the code suggests that the program spends most of its time in the `isPrime` function, and that the `sqrt` function executes the most often. Let’s next use code profiling to verify this hypothesis.

### 12.1.1. Using Callgrind to Profile

In our small program, it was relatively straightforward to use manual inspection to form the hypothesis that the `sqrt` function occurs in a "hot spot" in the code. However, identifying hot spots can become more complex in larger programs. Regardless, it is a good idea to use profiling to verify our hypothesis. Code profiling tools like Valgrind provide a lot of information about program execution. In this section, we use the `callgrind` tool to inspect the `OptExample` program’s call graph.

To use `callgrind`, let’s start by recompiling the `optExample` program with the `-g` flag and running `callgrind` on a smaller range (2 to 100,000). Like other Valgrind applications, `callgrind` runs as a wrapper around a program, adding annotations such as the number of times functions execute and the total number of instructions that are executed as a result. Consequently, the `optExample` program will take longer to execute when run in conjunction with `callgrind`.

```\$ gcc -g -o optExample optExample.c -lm
\$ valgrind --tool=callgrind ./optExample 100000
==32590== Callgrind, a call-graph generating cache profiler
==32590== Copyright (C) 2002-2015, and GNU GPL'd, by Josef Weidendorfer et al.
==32590== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==32590== Command: ./optExample 100000
==32590==
==32590== For interactive control, run 'callgrind_control -h'.
Time to allocate: 0.003869
Time to generate primes: 0.644743
9592 primes found.
==32590==
==32590== Events    : Ir
==32590== Collected : 68338759
==32590==
==32590== I   refs:      68,338,759```

Typing `ls` at the terminal reveals a new file called `callgrind.out.xxxxx`, where `xxxxx` is a unique id. In this case, the file is `callgrind.out.32590` (i.e., the number shown along the left-hand column in the preceding output). Running `callgrind_annotate` on this file yields additional information on the three functions of interest:

```\$ callgrind_annotate --auto=yes callgrind.out.32590
----------------------------------------------------------------
Profile data file 'callgrind.out.32393' (creator: callgrind-3.11.0)
----------------------------------------------------------------
...
.  //helper function: checks to see if a number is prime
400,004  int isPrime(int x) {
.      int i;
36,047,657      for (i = 2; i < sqrt(x)+1; i++) { //no prime is less than 2
13,826,015  => ???:sqrt (2765204x)
16,533,672          if (x % i == 0) { //if the number is divisible by i
180,818              return 0; //it is not prime
.          }
.      }
9,592      return 1; //otherwise it is prime
200,002  }
.
.  // finds the next prime
38,368  int getNextPrime(int prev) {
28,776      int next = prev + 1;
509,597      while (!isPrime(next)) { //while the number is not prime
67,198,556  => optExample.c:isPrime (100001x)
90,409          next++; //increment and check again
.      }
9,592      return next;
19,184  }
.
.  // generates a sequence of primes
6  int genPrimeSequence(int * array, int limit) {
.      int i;
2      int len = limit;
2      if (len == 0) return 0;
2      array=2; //initialize the first number to 2
38,369      for (i = 1; i < len; i++) {
143,880          array[i] = getNextPrime(array[i-1]); //fill in the array
67,894,482  => optExample.c:getNextPrime (9592x)
76,736          if (array[i] > limit){
2              len = i;
2              return len;
.          }
.      }
.      return len;
4  }```

The numbers along the left-hand column represent the number of total executed instructions associated with each line. The numbers in parentheses indicate the number of times a particular function was run. Using the numbers along the left-hand column, we are able to verify the results of our manual inspection. In the `genPrimeSequence` function, the `getNextPrime` function resulted in the most number of executed instructions at 67.8 million instructions, corresponding to 9,592 function calls (to generate the primes between 2 and 100,000). Inspecting `getNextPrime` reveals that the majority of those instructions (67.1 million, or 99%) result from the call to `isPrime`, which is called a total of 100,001 times. Lastly, inspecting `isPrime` reveals that 13 million of the total instructions (20.5%) result from the `sqrt` function, which executes a total of 2.7 million times.

These results verify our original hypothesis that the program spends most of its time in the `isPrime` function, with the `sqrt` function executing the most frequently of all the functions. Reducing the total number of executed instructions results in a faster program; the above analysis suggests that our initial efforts should concentrate on improving the `isPrime` function, and potentially reducing the number of times `sqrt` executes.

### 12.1.2. Loop-Invariant Code Motion

Loop-invariant code motion is an optimization technique that moves static computations that occur inside a loop to outside the loop without affecting the loop’s behavior. Optimizing compilers are capable of making most loop-invariant code optimizations automatically. Specifically, the `-fmove-loop-invariants` compiler flag in GCC (enabled at level `-O1`) attempts to identify examples of loop-invariant code motion and move them outside their respective loop.

However, the compiler cannot always identify cases of loop-invariant code motion, especially in the case of function calls. Since function calls can inadvertently cause side effects (unintended behavior), most compilers will avoid trying to determine whether a function call consistently returns the same result. Thus, even though the programmer knows that `sqrt(x)` always returns the square root of some input `x`, GCC will not always make that assumption. Consider the case where the `sqrt` function updates a secret global variable, `g`. In that case, calling `sqrt` once outside of the function (one update to `g`) is not the same as calling it every iteration of the loop (n updates to `g`). If a compiler cannot determine that a function always returns the same result, it will not automatically move the `sqrt` function outside the loop.

However, the programmer knows that moving the computation `sqrt(x) + 1` outside the `for` loop does not effect the loop’s behavior. The updated function is shown here and is available in the following file:

``````//helper function: checks to see if a number is prime
int isPrime(int x) {
int i;
int max = sqrt(x)+1;
for (i = 2; i < max; i++) { //no prime number is less than 2
if (x % i == 0) { //if the number is divisible by i
return 0; //it is not prime
}
}
return 1; //otherwise it is prime
}``````

Table 2 shows that this simple change shaves off a full two seconds (47%) of the runtime of `optExample2`, even before using compiler flags. Furthermore, the compiler seems to have a slightly easier time optimizing `optExample2`.

Table 2. Time in Seconds to Produce the Prime Numbers Between 2 and 5,000,000
Version Unoptimized `-O1` `-O2` `-O3`

Original

3.86

2.32

2.14

2.15

With loop-invariant code motion

1.83

1.63

1.71

1.63

Rerunning `callgrind` on the `optExample2` executable reveals why such a large improvement in runtime was observed. The following code snippet assumes that the file `callgrind.out.30086` contains the annotations of running `callgrind` on the `optExample2` executable:

```\$ gcc -g -o optExample2 optExample2.c -lm
\$ valgrind --tool=callgrind ./optExample2 100000
\$ callgrind_annotate --auto=yes callgrind.out.30086
------------------------------------------------------------------
Profile data file 'callgrind.out.30086' (creator: callgrind-3.11.0)
------------------------------------------------------------------
...
400,004  int isPrime(int x) {
.      int i;
900,013      int max = sqrt(x)+1;
500,000  => ???:sqrt (100001x)
11,122,449      for (i = 2; i < max; i++) { //no prime number is less than 2
16,476,120          if (x % i == 0) { //if the number is divisible by i
180,818              return 0; //it is not prime
.          }
.      }
9,592      return 1; //otherwise it is prime
200,002  }
.
.  // finds the next prime
38,368  int getNextPrime(int prev) {
28,776      int next = prev + 1;
509,597      while (!isPrime(next)) { //while the number is not prime
29,789,794  => optExample2.c:isPrime (100001x)
90,409          next++; //increment and check again
.      }
9,592      return next;
19,184  }```

Moving the call to `sqrt` outside of the `for` loop reduces the number of times the `sqrt` function is called in the program from 2.7 million to 100,000 (96% reduction). This number corresponds to the number of times the `isPrime` function is called, confirming that the `sqrt` function executes only once with every invocation of the `isPrime` function.

Note that the compiler was able to perform significant levels of optimization when optimization flags were specified, even if the programmer does not manually perform code motion. In this case, the reason is due to a special instruction called `fsqrt` that is specified by the x86 ISA. When optimization flags are turned on, the compiler replaces all instances of the `sqrt` function with the `fsqrt` instruction. This process is known as inlining, and we cover it greater detail in the following section. Because `fsqrt` is no longer a function, it is easier for the compiler to identify its loop-invariant nature, and move it outside the body of the loop.