12.1. Code Optimization First Steps: Code Profiling
"The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming."  Don Knuth, The Art of Computer Programming
One of the biggest dangers in code optimization is the concept of premature optimization. Premature optimization occurs when a programmer attempts to optimize based on "gut feelings" of where performance inefficiencies occur, and not on data. Whenever possible, it is important to measure the runtime of different portions of code on different inputs prior to starting optimization to identify hot spots or areas in the program in which the most instructions occur.
To figure out how to optimize optExample.c,
let’s start by taking a closer look at the main
function:
int main(int argc, char **argv) {
// errorhandling and timing code omitted for brevity
int limit = strtol(argv[1], NULL, 10);
int length = limit;
int *array = allocateArray(length); //allocates array of specified length
genPrimeSequence(array, limit, &length); //generates sequence of primes
return 0;
}
The main
function contains calls to two functions: allocateArray
, which
initializes an array of a userspecified length (or limit), and
genPrimeSequence
, which generates a sequence of primes within the specified
limit (note that for any sequence between 2 and n, there cannot be more than
n primes, and frequently there are significantly less). The main
function
in the C file contains code that times each
of the two functions above. Compiling and running the code with limit
set to
5,000,000 reveals the following:
$ gcc o optExample optExample.c lm $ time p ./optExample 5000000 Time to allocate: 5.5e05 Time to generate primes: 3.85525 348513 primes found. real 3.85 user 3.86 sys 0.00
The optExample
program takes approximately 3.86 seconds to complete, with
nearly all of the time in the genPrimeSequence
function. There is no point
in spending time optimizing allocateArray
as any improvements will be
negligible to the runtime of the overall program. In the examples that follow,
we focus more closely on the genPrimeSequence
function and its associated
functions. The functions are reproduced below for convenience:
// helper function: checks to see if a number is prime
int isPrime(int x) {
int i;
for (i = 2; i < sqrt(x) + 1; i++) { //no prime number is less than 2
if (x % i == 0) { //if the number is divisible by i
return 0; //it is not prime
}
}
return 1; //otherwise it is prime
}
// finds the next prime
int getNextPrime(int prev) {
int next = prev + 1;
while (!isPrime(next)) { //while the number is not prime
next++; //increment and check again
}
return next;
}
// generates a sequence of primes
int genPrimeSequence(int *array, int limit) {
int i;
int len = limit;
if (len == 0) return 0;
array[0] = 2; //initialize the first number to 2
for (i = 1; i < len; i++) {
array[i] = getNextPrime(array[i1]); //fill in the array
if (array[i] > limit) {
len = i;
return len;
}
}
return len;
}
To find hot spots in a program, focus on the areas with the most loops. Manual
inspection of code can assist in locating hot spots, though it should always be
verified with benchmarking tools prior to attempting optimization. A manual
inspection of the optExample
program yields the following observations.

The
genPrimeSequence
function attempts to generate all the prime numbers between 2 and some integer n. Since the number of primes between 2 and n cannot exceed n, thefor
loop ingenPrimeSequence
runs no more than n times. Every iteration of thefor
loop calls thegetNextPrime
function once. Thus,getNextPrime
runs no more than n times. 
The
while
loop in thegetNextPrime
function will continue running until a prime is discovered. While it is difficult to determine the number of times thewhile
loop in thegetNextPrime
function will execute ahead of time as a function of n (the gap between consecutive prime numbers can be arbitrarily large), it is certain thatisPrime
executes on every iteration of thewhile
loop. 
The
isPrime
function contains exactly onefor
loop. Suppose the loop runs for a total of k iterations. Then, the code in the loop body runs k times in total. Recall that the structure of a for loop consists of an initialization statement (which initializes the loop variable to a particular value), a boolean expression (that determines when to terminate the loop), and a step expression (that updates the loop variable every iteration). Table 1 depicts the number of times each loop component executes in afor
loop that runs for k iterations. In everyfor
loop, initialization happens exactly once. The boolean expression executes k + 1 times for k iterations, since it must perform one final check to terminate the loop. The loop body and the step expression execute k times each.
initialization statement  boolean expression  step expression  loop body 

1 
k+1 
k 
k 
Our manual inspection of the code suggests that the program spends most of its
time in the isPrime
function, and that the sqrt
function executes the
most often. Let’s next use code profiling to verify this hypothesis.
12.1.1. Using Callgrind to Profile
In our small program, it was relatively straightforward to use manual
inspection to form the hypothesis that the sqrt
function occurs in a "hot
spot" in the code. However, identifying hot spots can become more complex in
larger programs. Regardless, it is a good idea to use profiling to verify our
hypothesis. Code profiling tools like Valgrind provide a
lot of information about program execution. In this section, we use the
callgrind
tool to inspect the OptExample
program’s call graph.
To use callgrind
, let’s start by recompiling the optExample
program with
the g
flag and running callgrind
on a smaller range (2 to 100,000).
Like other Valgrind applications, callgrind
runs as a wrapper around a
program, adding annotations such as the number of times functions execute and
the total number of instructions that are executed as a result. Consequently,
the optExample
program will take longer to execute when run in conjunction
with callgrind
.
$ gcc g o optExample optExample.c lm $ valgrind tool=callgrind ./optExample 100000 ==32590== Callgrind, a callgraph generating cache profiler ==32590== Copyright (C) 20022015, and GNU GPL'd, by Josef Weidendorfer et al. ==32590== Using Valgrind3.11.0 and LibVEX; rerun with h for copyright info ==32590== Command: ./optExample 100000 ==32590== ==32590== For interactive control, run 'callgrind_control h'. Time to allocate: 0.003869 Time to generate primes: 0.644743 9592 primes found. ==32590== ==32590== Events : Ir ==32590== Collected : 68338759 ==32590== ==32590== I refs: 68,338,759
Typing ls
at the terminal reveals a new file called callgrind.out.xxxxx
,
where xxxxx
is a unique id. In my case, the file is callgrind.out.32590
(i.e., the number shown along the lefthand column of the above output).
Running callgrind_annotate
on this file yields additional information on the
three functions of interest:
$ callgrind_annotate auto=yes callgrind.out.32590  Profile data file 'callgrind.out.32393' (creator: callgrind3.11.0)  ... . //helper function: checks to see if a number is prime 400,004 int isPrime(int x) { . int i; 36,047,657 for (i = 2; i < sqrt(x)+1; i++) { //no prime is less than 2 13,826,015 => ???:sqrt (2765204x) 16,533,672 if (x % i == 0) { //if the number is divisible by i 180,818 return 0; //it is not prime . } . } 9,592 return 1; //otherwise it is prime 200,002 } . . // finds the next prime 38,368 int getNextPrime(int prev) { 28,776 int next = prev + 1; 509,597 while (!isPrime(next)) { //while the number is not prime 67,198,556 => optExample.c:isPrime (100001x) 90,409 next++; //increment and check again . } 9,592 return next; 19,184 } . . // generates a sequence of primes 6 int genPrimeSequence(int * array, int limit) { . int i; 2 int len = limit; 2 if (len == 0) return 0; 2 array[0]=2; //initialize the first number to 2 38,369 for (i = 1; i < len; i++) { 143,880 array[i] = getNextPrime(array[i1]); //fill in the array 67,894,482 => optExample.c:getNextPrime (9592x) 76,736 if (array[i] > limit){ 2 len = i; 2 return len; . } . } . return len; 4 }
The numbers along the lefthand column represent the number of total executed
instructions associated with each line. The numbers in parentheses indicate
the number of times a particular function was run. Using the numbers along the
lefthand column, we are able to verify the results of our manual inspection.
In the genPrimeSequence
function, the getNextPrime
function resulted in
the most number of executed instructions at 67.8 million instructions,
corresponding to 9,592 function calls (to generate the primes between 2 and
100,000). Inspecting getNextPrime
reveals that the majority of those
instructions (67.1 million, or 99%) result from the call to isPrime
, which
is called a total of 100,001 times. Lastly, inspecting isPrime
reveals
that 13 million of the total instructions (20.5%) result from the sqrt
function, which executes a total of 2.7 million times.
These results verify our original hypothesis that the program spends most of
its time in the isPrime
function, with the sqrt
function executing the
most frequently of all the functions. Reducing the total number of executed
instructions results in a faster program; the above analysis suggests that
our initial efforts should concentrate on improving the isPrime
function, and potentially reducing the number of times sqrt
executes.
12.1.2. LoopInvariant Code Motion
Loopinvariant code motion is an optimization technique that moves static
computations that occur inside a loop to outside the loop without affecting the
loop’s behavior. Optimizing compilers are capable of making most loopinvariant
code optimizations automatically. Specifically, the fmoveloopinvariants
compiler flag in GCC (enabled at level O1
) attempts to identify examples of
loopinvariant code motion and move them outside their respective loop.
However, the compiler cannot always identify cases of loopinvariant code
motion, especially in the case of function calls. Since function calls can
inadvertently cause side effects (unintended behavior), most compilers will
avoid trying to determine if a function call consistently returns the same
result. Thus, while the programmer knows that sqrt(x)
always returns the
square root of some input x
, GCC will not always make that assumption.
Consider the case where the sqrt
function updates a secret global
variable g
. In that case, calling sqrt
once outside of the function
(one update to g
) is not the same as calling it every iteration of the loop
(n updates to g
). If a compiler cannot determine that a function always
returns the same result, it will not automatically move the sqrt
function
outside the loop.
However, the programmer knows that moving the computation sqrt(x) + 1
outside
the for
loop does not effect the loop’s behavior. The updated function is
shown below and is available in the
following file:
//helper function: checks to see if a number is prime
int isPrime(int x) {
int i;
int max = sqrt(x)+1;
for (i = 2; i < max; i++) { //no prime number is less than 2
if (x % i == 0) { //if the number is divisible by i
return 0; //it is not prime
}
}
return 1; //otherwise it is prime
}
Table 2 shows that this simple change shaves off a full 2 seconds (47%)
of the runtime of optExample2
, even before using compiler flags.
Furthermore, the compiler seems to have a slightly easier time optimizing
optExample2
.
Version  Unoptimized  O1 
O2 
O3 

Original 
3.86 
2.32 
2.14 
2.15 
With loopinvariant code motion 
1.83 
1.63 
1.71 
1.63 
Rerunning callgrind
on the optExample2
executable reveals why such a large
improvement in runtime was observed. The
following code snippet assumes that the file callgrind.out.30086
contains the annotations of running callgrind
on the optExample2
executable:
$ gcc g o optExample2 optExample2.c lm $ valgrind tool=callgrind ./optExample2 100000 $ callgrind_annotate auto=yes callgrind.out.30086  Profile data file 'callgrind.out.30086' (creator: callgrind3.11.0)  ... 400,004 int isPrime(int x) { . int i; 900,013 int max = sqrt(x)+1; 500,000 => ???:sqrt (100001x) 11,122,449 for (i = 2; i < max; i++) { //no prime number is less than 2 16,476,120 if (x % i == 0) { //if the number is divisible by i 180,818 return 0; //it is not prime . } . } 9,592 return 1; //otherwise it is prime 200,002 } . . // finds the next prime 38,368 int getNextPrime(int prev) { 28,776 int next = prev + 1; 509,597 while (!isPrime(next)) { //while the number is not prime 29,789,794 => optExample2.c:isPrime (100001x) 90,409 next++; //increment and check again . } 9,592 return next; 19,184 }
Moving the call to sqrt
outside of the for
loop reduces the number of
times the sqrt
function is called in the program from 2.7 million to
100,000 (96% reduction). This number corresponds to the number of times the
isPrime
function is called, confirming that the sqrt
function executes
only once with every invocation of the isPrime
function.
Note that the compiler was able to perform significant levels of optimization
when optimization flags were specified, even if the programmer does not manually
perform code motion. In this case, the reason is due to a special instruction
called fsqrt
that is specified by the x86 ISA. When optimization flags are
turned on, the compiler replaces all instances of the sqrt
function with
the fsqrt
instruction. This process is known as inlining, and we cover it
greater detail in the following section. Since fsqrt
is no longer a function,
it is easier for the compiler to identify its loopinvariant nature, and move
it outside the body of the loop.