12.1. Code Optimization First Steps: Code Profiling
"The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming."  Don Knuth, The Art of Computer Programming
One of the biggest dangers in code optimization is the concept of premature optimization. Premature optimization occurs when a programmer attempts to optimize based on "gut feelings" of where performance inefficiencies occur, and not on data. Whenever possible, it is important to measure the runtime of different portions of code on different inputs prior to starting optimization to identify hot spots or areas in the program in which the most instructions occur.
To figure out how to optimize optExample.c, let’s start by taking
a closer look at the main()
function:
int main(int argc, char **argv) {
//omitted for brevity
int limit = strtol(argv[1], NULL, 10);
int length = limit;
int *array = allocateArray(length); //allocates array of specified length
genPrimeSequence(array, limit, &length); //generates sequence of primes
return 0;
}
The main()
function contains calls to two functions: allocateArray()
, which initializes
an array of a userspecified length (or limit); and genPrimeSequence()
that generates a sequence of
primes within the specified limit (note that for any sequence between 2 and n, there cannot be
more than n primes, and frequently there is significantly less). The main()
function in the
C file contains code that times each of the two functions above. Compiling and running the code
with limit set to 5,000,000 reveals the following:
$ gcc o optExample optExample.c lm $ time p ./optExample 5000000 Time to allocate: 5.5e05 Time to generate primes: 3.85525 348513 primes found. real 3.85 user 3.86 sys 0.00
The optExample
program takes approximately 3.86 seconds to complete, with nearly all of the time in the
genPrimeSequence()
function. There is no point in spending time optimizing allocateArray()
as any
improvements will be negligible to the runtime of the overall program. In the examples that follow,
we focus more closely on the genPrimeSequence()
function and its associated functions. The functions
are reproduced below for the reader’s convenience:
//helper function: checks to see if a number is prime
int isPrime(int x) {
int i;
for (i = 2; i < sqrt(x) + 1; i++) { //no prime number is less than 2
if (x % i == 0) { //if the number is divisible by i
return 0; //it is not prime
}
}
return 1; //otherwise it is prime
}
// finds the next prime
int getNextPrime(int prev) {
int next = prev + 1;
while (!isPrime(next)) { //while the number is not prime
next++; //increment and check again
}
return next;
}
// generates a sequence of primes
int genPrimeSequence(int *array, int limit) {
int i;
int len = limit;
if (len == 0) return 0;
array[0] = 2; //initialize the first number to 2
for (i = 1; i < len; i++) {
array[i] = getNextPrime(array[i1]); //fill in the array
if (array[i] > limit) {
len = i;
return len;
}
}
return len;
}
To find hot spots in a program, focus on the areas with the most loops. Manual inspection of code can assist in locating hot spots,
though it should always be verified with benchmarking tools prior to attempting optimization. A manual inspection of the optExample
program yields the following:

The
genPrimeSequence()
function attempts to generate all the prime numbers between 2 and some integer n. Since the number of primes between 2 and n cannot exceed n, thefor
loop ingenPrimeSequence()
runs no more than n times. Every iteration of thefor
loop calls thegetNextPrime()
function once. Thus,getNextPrime()
runs no more than n times. 
The
while
loop in thegetNextPrime()
function will continue running until a prime is discovered. While it is difficult to determine the number of times thewhile
loop in thegetNextPrime()
function will execute ahead of time as a function of n (the gap between consecutive prime numbers can be arbitrarily large), it is certain thatisPrime()
executes every iteration of thewhile
loop. 
The
isPrime()
function contains exactly onefor
loop. Suppose the loop runs a total of k iterations. Then, the code in the loop body runs k total times. Recall that the structure of a forloop consists of an initialization statement (which initializes the loop variable to a particular value), a boolean expression (that determines when to terminate the loop), and a step expression (that updates the loop variable every iteration). Table 1 depicts the number of times each loop component executes in afor
loop that runs for k iterations. In everyfor
loop, initialization happens exactly once. The boolean expression executes k+1 times, since it must perform one final check to terminate the loop. The loop body and the step expression execute k times each.
initialization statement  boolean expression  step expression  loop body 

1 
k+1 
k 
k 
Our manual inspection of the code suggests that the program spends most of its time in the isPrime()
function, and that the sqrt()
function
executes the most often. Let’s next use code profiling to verify this hypothesis.
12.1.1. Using Callgrind to profile
In our small program, it was relatively straightforward to use manual inspection to form the hypothesis that the sqrt()
function occurs in a "hot spot" in the code.
However, identifying hot spots can become more complex in larger programs. Regardless, it is a good idea to use profiling to verify our hypothesis.
Code profiling tools like Valgrind provide a lot of information about program execution. In this section, we use the callgrind
tool
to inspect the OptExample
program’s call graph.
To use callgrind
, let’s start by recompiling the optExample
program with the g
flag and running callgrind
on a smaller range (2
to 100,000
). Like other
Valgrind applications, callgrind
runs as a wrapper around a program, adding annotations such as the number of times functions execute and the total number of
instructions that are executed as a result. Consequently, the optExample
program will take longer to execute when run in conjunction with callgrind
.
$ gcc g o optExample optExample.c lm $ valgrind tool=callgrind ./optExample 100000 ==32590== Callgrind, a callgraph generating cache profiler ==32590== Copyright (C) 20022015, and GNU GPL'd, by Josef Weidendorfer et al. ==32590== Using Valgrind3.11.0 and LibVEX; rerun with h for copyright info ==32590== Command: ./optExample 100000 ==32590== ==32590== For interactive control, run 'callgrind_control h'. Time to allocate: 0.003869 Time to generate primes: 0.644743 9592 primes found. ==32590== ==32590== Events : Ir ==32590== Collected : 68338759 ==32590== ==32590== I refs: 68,338,759
Typing ls
on the terminal reveals a new file called callgrind.out.xxxxx
where xxxxx
is a unique id. In my case, the file is callgrind.out.32590
(i.e., the number shown along the lefthand column of the above output). Running callgrind_annotate
on this file yields additional information on the three functions of interest:
$ callgrind_annotate auto=yes callgrind.out.32590  Profile data file 'callgrind.out.32393' (creator: callgrind3.11.0)  ... . //helper function: checks to see if a number is prime 400,004 int isPrime(int x) { . int i; 36,047,657 for (i = 2; i < sqrt(x)+1; i++) { //no prime number is less than 2 13,826,015 => ???:sqrt (2765204x) 16,533,672 if (x % i == 0) { //if the number is divisible by i 180,818 return 0; //it is not prime . } . } 9,592 return 1; //otherwise it is prime 200,002 } . . // finds the next prime 38,368 int getNextPrime(int prev) { 28,776 int next = prev + 1; 509,597 while (!isPrime(next)) { //while the number is not prime 67,198,556 => optExample.c:isPrime (100001x) 90,409 next++; //increment and check again . } 9,592 return next; 19,184 } . . // generates a sequence of primes 6 int genPrimeSequence(int * array, int limit) { . int i; 2 int len = limit; 2 if (len == 0) return 0; 2 array[0]=2; //initialize the first number to 2 38,369 for (i = 1; i < len; i++) { 143,880 array[i] = getNextPrime(array[i1]); //fill in the array 67,894,482 => optExample.c:getNextPrime (9592x) 76,736 if (array[i] > limit){ 2 len = i; 2 return len; . } . } . return len; 4 }
The numbers along the lefthand column represent the number of total executed instructions associated with each line. The numbers in the parentheses indicate the number of times a particular function was
run. Using the numbers along the lefthand column, we are able to verify the results of our manual inspection. In the genPrimeSequence()
function, the getNextPrime()
function resulted in
the most number of executed instructions at 67.8 million instructions, corresponding to 9,592 function calls (to generate the primes between 2
and 100,000
). Inspecting
getNextPrime()
reveals that the majority of those instructions (67.1 million, or 99%) result from the call to isPrime()
, which is called a total of 100,001 times.
Lastly, inspecting isPrime()
reveals that 13 million of the total instructions (20.5%) result from the sqrt()
function, which executes a total of 2.7 million times.
These results verify our original hypothesis that the program spends most of its time in the isPrime()
function, with the sqrt()
function executing the most frequently of all the functions.
Reducing the total number of executed instructions will result in a faster program; the above analysis suggests that our initial set of efforts should concentrate in improving the isPrime()
function, and potentially reducing the number of times sqrt()
executes.
12.1.2. Loopinvariant Code Motion
Loopinvariant code motion is an optimization technique that moves static computations that occur inside a loop to outside
the loop without affecting the loop’s behavior. Optimizing compilers are capable of making most loopinvariant code optimizations
automatically. Specifically, the fmoveloopinvariants
compiler flag in gcc (enabled at level O1
) attempts to identify examples
of loopinvariant code motion and move it outside of a loop.
However, the compiler cannot always identify cases of loopinvariant code motion, especially in the case of function calls. Since
function calls can inadvertently cause side effects (unintended behavior), most compilers will avoid trying to determine if a
function call consistently returns the same result. Thus, while the programmer knows that sqrt(x)
always returns the
square root of some input x
, gcc will not always make that assumption. Consider the case where the sqrt()
function updates a
secret global variable g
. In that case, calling sqrt()
once outside of the function (one update to g
) is not the same as
calling it every iteration of the loop (n updates to g
). If a compiler cannot determine that a function always returns the
same result, it will not automatically move the sqrt()
function outside the loop.
However, the programmer knows that moving the computation sqrt(x) + 1
outside the for loop does not effect the loop’s behavior.
The updated function is shown below and is available in the following file:
//helper function: checks to see if a number is prime
int isPrime(int x) {
int i;
int max = sqrt(x)+1;
for (i = 2; i < max; i++) { //no prime number is less than 2
if (x % i == 0) { //if the number is divisible by i
return 0; //it is not prime
}
}
return 1; //otherwise it is prime
}
Table 2 shows that this simple change shaves off a full 2 seconds (47%) of the runtime of optExample2
, even before using
compiler flags. Furthermore, the compiler seems to have a slightly easier time optimizing optExample2
.
Version  Unoptimized  O1 
O2 
O3 

Original 
3.86 
2.32 
2.14 
2.15 
With loopinvariant code motion 
1.83 
1.63 
1.71 
1.63 
Rerunning callgrind
on the optExample2
executable reveals why such a large improvement in runtime was observed. The
following code snippet assumes that the file callgrind.out.30086
contains the annotations of running callgrind on the
optExample2
executable:
$ gcc g o optExample2 optExample2.c lm $ valgrind tool=callgrind ./optExample2 100000 $ callgrind_annotate auto=yes callgrind.out.30086  Profile data file 'callgrind.out.30086' (creator: callgrind3.11.0)  ... 400,004 int isPrime(int x) { . int i; 900,013 int max = sqrt(x)+1; 500,000 => ???:sqrt (100001x) 11,122,449 for (i = 2; i < max; i++) { //no prime number is less than 2 16,476,120 if (x % i == 0) { //if the number is divisible by i 180,818 return 0; //it is not prime . } . } 9,592 return 1; //otherwise it is prime 200,002 } . . // finds the next prime 38,368 int getNextPrime(int prev) { 28,776 int next = prev + 1; 509,597 while (!isPrime(next)) { //while the number is not prime 29,789,794 => optExample2.c:isPrime (100001x) 90,409 next++; //increment and check again . } 9,592 return next; 19,184 }
Moving the call to sqrt()
outside of the for
loop reduces the number of times the sqrt()
function is called in the program from 2.7 million to 100,000 (96% reduction). This number
corresponds to the number of times the isPrime()
function is called, confirming that the sqrt()
function executes only once with every invocation of the isPrime()
function.
Note that the compiler was able to perform significant levels of optimization when optimization flags are specified, even if the programmer does not manually perform code motion. In
this case, the reason is due to a special instruction called fsqrt
that is specified by the x86 ISA. When optimization flags are turned on, the compiler replaces all instances
of the sqrt()
function with the fsqrt
instruction. This process is known as inlining, and we cover it greater detail in the following section. Since fsqrt
is no longer a
function, it is easier for the compiler to identify its loopinvariant nature, and move it outside the body of the loop.