12.1. Code Optimization First Steps: Code Profiling
"The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming." - Don Knuth, The Art of Computer Programming
One of the biggest dangers in code optimization is the concept of premature optimization. Premature optimization occurs when a programmer attempts to optimize based on "gut feelings" of where performance inefficiencies occur, and not on data. Whenever possible, it is important to measure the runtime of different portions of code on different inputs prior to starting optimization to identify hot spots or areas in the program in which the most instructions occur.
To figure out how to optimize optExample.c,
let’s start by taking a closer look at the main function:
int main(int argc, char **argv) {
// error-handling and timing code omitted for brevity
int limit = strtol(argv[1], NULL, 10);
int length = limit;
int *array = allocateArray(length); //allocates array of specified length
genPrimeSequence(array, limit, &length); //generates sequence of primes
return 0;
}
The main function contains calls to two functions: allocateArray, which
initializes an array of a user-specified length (or limit), and
genPrimeSequence, which generates a sequence of primes within the specified
limit (note that for any sequence between 2 and n, there cannot be more than
n primes, and frequently there are significantly less). The main function
in the C file contains code that times each
of the two functions above. Compiling and running the code with limit set to
5,000,000 reveals the following:
$ gcc -o optExample optExample.c -lm $ time -p ./optExample 5000000 Time to allocate: 5.5e-05 Time to generate primes: 3.85525 348513 primes found. real 3.85 user 3.86 sys 0.00
The optExample program takes approximately 3.86 seconds to complete, with
nearly all of the time in the genPrimeSequence function. There is no point
in spending time optimizing allocateArray, because any improvements will be
negligible to the runtime of the overall program. In the examples that follow,
we focus more closely on the genPrimeSequence function and its associated
functions. The functions are reproduced here for convenience:
// helper function: checks to see if a number is prime
int isPrime(int x) {
int i;
for (i = 2; i < sqrt(x) + 1; i++) { //no prime number is less than 2
if (x % i == 0) { //if the number is divisible by i
return 0; //it is not prime
}
}
return 1; //otherwise it is prime
}
// finds the next prime
int getNextPrime(int prev) {
int next = prev + 1;
while (!isPrime(next)) { //while the number is not prime
next++; //increment and check again
}
return next;
}
// generates a sequence of primes
int genPrimeSequence(int *array, int limit) {
int i;
int len = limit;
if (len == 0) return 0;
array[0] = 2; //initialize the first number to 2
for (i = 1; i < len; i++) {
array[i] = getNextPrime(array[i-1]); //fill in the array
if (array[i] > limit) {
len = i;
return len;
}
}
return len;
}
To find hot spots in a program, focus on the areas with the most loops. Manual
inspection of code can assist in locating hot spots, though it should always be
verified with benchmarking tools prior to attempting optimization. A manual
inspection of the optExample program yields the following observations.
-
The
genPrimeSequencefunction attempts to generate all the prime numbers between 2 and some integer n. Since the number of primes between 2 and n cannot exceed n, theforloop ingenPrimeSequenceruns no more than n times. Every iteration of theforloop calls thegetNextPrimefunction once. Thus,getNextPrimeruns no more than n times. -
The
whileloop in thegetNextPrimefunction will continue running until a prime is discovered. Although it is difficult to determine the number of times thewhileloop in thegetNextPrimefunction will execute ahead of time as a function of n (the gap between consecutive prime numbers can be arbitrarily large), it is certain thatisPrimeexecutes on every iteration of thewhileloop. -
The
isPrimefunction contains exactly oneforloop. Suppose that the loop runs for a total of k iterations. Then, the code in the loop body runs k times in total. Recall that the structure of a for loop consists of an initialization statement (which initializes the loop variable to a particular value), a Boolean expression (that determines when to terminate the loop), and a step expression (that updates the loop variable every iteration). Table 1 depicts the number of times each loop component executes in aforloop that runs for k iterations. In everyforloop, initialization happens exactly once. The Boolean expression executes k + 1 times for k iterations, since it must perform one final check to terminate the loop. The loop body and the step expression execute k times each.
| Initialization statement | Boolean expression | Step expression | Loop body |
|---|---|---|---|
1 |
k+1 |
k |
k |
Our manual inspection of the code suggests that the program spends most of its
time in the isPrime function, and that the sqrt function executes the
most often. Let’s next use code profiling to verify this hypothesis.
12.1.1. Using Callgrind to Profile
In our small program, it was relatively straightforward to use manual
inspection to form the hypothesis that the sqrt function occurs in a "hot
spot" in the code. However, identifying hot spots can become more complex in
larger programs. Regardless, it is a good idea to use profiling to verify our
hypothesis. Code profiling tools like Valgrind provide a
lot of information about program execution. In this section, we use the
callgrind tool to inspect the OptExample program’s call graph.
To use callgrind, let’s start by recompiling the optExample program with
the -g flag and running callgrind on a smaller range (2 to 100,000).
Like other Valgrind applications, callgrind runs as a wrapper around a
program, adding annotations such as the number of times functions execute and
the total number of instructions that are executed as a result. Consequently,
the optExample program will take longer to execute when run in conjunction
with callgrind.
$ gcc -g -o optExample optExample.c -lm $ valgrind --tool=callgrind ./optExample 100000 ==32590== Callgrind, a call-graph generating cache profiler ==32590== Copyright (C) 2002-2015, and GNU GPL'd, by Josef Weidendorfer et al. ==32590== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info ==32590== Command: ./optExample 100000 ==32590== ==32590== For interactive control, run 'callgrind_control -h'. Time to allocate: 0.003869 Time to generate primes: 0.644743 9592 primes found. ==32590== ==32590== Events : Ir ==32590== Collected : 68338759 ==32590== ==32590== I refs: 68,338,759
Typing ls at the terminal reveals a new file called callgrind.out.xxxxx,
where xxxxx is a unique id. In this case, the file is callgrind.out.32590
(i.e., the number shown along the left-hand column in the preceding output).
Running callgrind_annotate on this file yields additional information on the
three functions of interest:
$ callgrind_annotate --auto=yes callgrind.out.32590
----------------------------------------------------------------
Profile data file 'callgrind.out.32393' (creator: callgrind-3.11.0)
----------------------------------------------------------------
...
. //helper function: checks to see if a number is prime
400,004 int isPrime(int x) {
. int i;
36,047,657 for (i = 2; i < sqrt(x)+1; i++) { //no prime is less than 2
13,826,015 => ???:sqrt (2765204x)
16,533,672 if (x % i == 0) { //if the number is divisible by i
180,818 return 0; //it is not prime
. }
. }
9,592 return 1; //otherwise it is prime
200,002 }
.
. // finds the next prime
38,368 int getNextPrime(int prev) {
28,776 int next = prev + 1;
509,597 while (!isPrime(next)) { //while the number is not prime
67,198,556 => optExample.c:isPrime (100001x)
90,409 next++; //increment and check again
. }
9,592 return next;
19,184 }
.
. // generates a sequence of primes
6 int genPrimeSequence(int * array, int limit) {
. int i;
2 int len = limit;
2 if (len == 0) return 0;
2 array[0]=2; //initialize the first number to 2
38,369 for (i = 1; i < len; i++) {
143,880 array[i] = getNextPrime(array[i-1]); //fill in the array
67,894,482 => optExample.c:getNextPrime (9592x)
76,736 if (array[i] > limit){
2 len = i;
2 return len;
. }
. }
. return len;
4 }
The numbers along the left-hand column represent the number of total executed
instructions associated with each line. The numbers in parentheses indicate
the number of times a particular function was run. Using the numbers along the
left-hand column, we are able to verify the results of our manual inspection.
In the genPrimeSequence function, the getNextPrime function resulted in
the most number of executed instructions at 67.8 million instructions,
corresponding to 9,592 function calls (to generate the primes between 2 and
100,000). Inspecting getNextPrime reveals that the majority of those
instructions (67.1 million, or 99%) result from the call to isPrime, which
is called a total of 100,001 times. Lastly, inspecting isPrime reveals
that 13 million of the total instructions (20.5%) result from the sqrt
function, which executes a total of 2.7 million times.
These results verify our original hypothesis that the program spends most of
its time in the isPrime function, with the sqrt function executing the
most frequently of all the functions. Reducing the total number of executed
instructions results in a faster program; the above analysis suggests that
our initial efforts should concentrate on improving the isPrime
function, and potentially reducing the number of times sqrt executes.
12.1.2. Loop-Invariant Code Motion
Loop-invariant code motion is an optimization technique that moves static
computations that occur inside a loop to outside the loop without affecting the
loop’s behavior. Optimizing compilers are capable of making most loop-invariant
code optimizations automatically. Specifically, the -fmove-loop-invariants
compiler flag in GCC (enabled at level -O1) attempts to identify examples of
loop-invariant code motion and move them outside their respective loop.
However, the compiler cannot always identify cases of loop-invariant code
motion, especially in the case of function calls. Since function calls can
inadvertently cause side effects (unintended behavior), most compilers will
avoid trying to determine whether a function call consistently returns the same
result. Thus, even though the programmer knows that sqrt(x) always returns the
square root of some input x, GCC will not always make that assumption.
Consider the case where the sqrt function updates a secret global
variable, g. In that case, calling sqrt once outside of the function
(one update to g) is not the same as calling it every iteration of the loop
(n updates to g). If a compiler cannot determine that a function always
returns the same result, it will not automatically move the sqrt function
outside the loop.
However, the programmer knows that moving the computation sqrt(x) + 1 outside
the for loop does not effect the loop’s behavior. The updated function is
shown here and is available in the
following file:
//helper function: checks to see if a number is prime
int isPrime(int x) {
int i;
int max = sqrt(x)+1;
for (i = 2; i < max; i++) { //no prime number is less than 2
if (x % i == 0) { //if the number is divisible by i
return 0; //it is not prime
}
}
return 1; //otherwise it is prime
}
Table 2 shows that this simple change shaves off a full two seconds (47%)
of the runtime of optExample2, even before using compiler flags.
Furthermore, the compiler seems to have a slightly easier time optimizing
optExample2.
| Version | Unoptimized | -O1 |
-O2 |
-O3 |
|---|---|---|---|---|
Original |
3.86 |
2.32 |
2.14 |
2.15 |
With loop-invariant code motion |
1.83 |
1.63 |
1.71 |
1.63 |
Rerunning callgrind on the optExample2 executable reveals why such a large
improvement in runtime was observed. The
following code snippet assumes that the file callgrind.out.30086 contains the annotations of running callgrind on the optExample2 executable:
$ gcc -g -o optExample2 optExample2.c -lm
$ valgrind --tool=callgrind ./optExample2 100000
$ callgrind_annotate --auto=yes callgrind.out.30086
------------------------------------------------------------------
Profile data file 'callgrind.out.30086' (creator: callgrind-3.11.0)
------------------------------------------------------------------
...
400,004 int isPrime(int x) {
. int i;
900,013 int max = sqrt(x)+1;
500,000 => ???:sqrt (100001x)
11,122,449 for (i = 2; i < max; i++) { //no prime number is less than 2
16,476,120 if (x % i == 0) { //if the number is divisible by i
180,818 return 0; //it is not prime
. }
. }
9,592 return 1; //otherwise it is prime
200,002 }
.
. // finds the next prime
38,368 int getNextPrime(int prev) {
28,776 int next = prev + 1;
509,597 while (!isPrime(next)) { //while the number is not prime
29,789,794 => optExample2.c:isPrime (100001x)
90,409 next++; //increment and check again
. }
9,592 return next;
19,184 }
Moving the call to sqrt outside of the for loop reduces the number of
times the sqrt function is called in the program from 2.7 million to
100,000 (96% reduction). This number corresponds to the number of times the
isPrime function is called, confirming that the sqrt function executes
only once with every invocation of the isPrime function.
Note that the compiler was able to perform significant levels of optimization
when optimization flags were specified, even if the programmer does not manually
perform code motion. In this case, the reason is due to a special instruction
called fsqrt that is specified by the x86 ISA. When optimization flags are
turned on, the compiler replaces all instances of the sqrt function with
the fsqrt instruction. This process is known as inlining, and we cover it
greater detail in the following section. Because fsqrt is no longer a function,
it is easier for the compiler to identify its loop-invariant nature, and move
it outside the body of the loop.