12.2. Other compiler optimizations: Loop unrolling and Function In-Lining
The loop-invariant code motion optimization described in the last section was a simple change that resulted in a massive reduction of execution time. However, such optimizations are situationally dependent, and may not always result in improvements to performance. In most cases, loop-invariant code motion is taken care of by the compiler.
Code today is more often read than it is written. In most cases, fractional performance gains are not worth the hit to code readability. In general, a programmer should let the compiler optimize whenever possible. In this section, we cover some optimization techniques that were previously manually implemented by programmers, but are today commonly implemented by compilers.
There are several sources on-line that advocate for the manual implementation of the techniques described below. However, we encourage readers to check if their compilers support the following optimizations, before attempting to manually implement them in their code. All the optimizations described in this section are implemented in gcc, but may not be available in older compilers.
12.2.1. Function in-lining
One optimization step that compilers attempt to perform is function in-lining. Function in-lining replaces calls to a function with the body of the function. For example, in the
main()
function, a compiler in-lining the allocateArray()
function will replace the call to allocateArray()
with a direct call to malloc()
:
Original Version | Version with allocateArray() in-lined |
---|---|
|
|
In-lining functions can result in some run-time savings for a program. Recall that every time a program
calls a function, many instructions associated with function creation and destruction are necessarily generated.
In-lining functions enables the compiler to eliminate these excessive calls, and makes it easier for the compiler to
identify other potential improvements, including constant propagation, constant folding and dead code elimination.
In the case of the optExample
program, in-lining likely allows the compiler to replace
the call to sqrt()
with the fsqrt
instruction, and subsequently move it outside the loop.
The -finline-functions
flag suggests to gcc that functions should be in-lined. This optimization is turned on at level 3. While -finline-functions
can be used independently of
the -O3
flag, it is a suggestion to the compiler to look for functions to in-line. Likewise the static inline
keyword can
be used to suggest to the compiler that a particular function should be in-lined. Keep in mind that the compiler will not in-line all functions,
and that function in-lining is not guaranteed to make code faster.
Programmers should generally avoid in-lining functions manually. In-lining functions carries a high risk of significantly reducing the readability of
code, increasing the likelihood of errors, and making it harder to update and maintain functions. For example, trying to in-line the isPrime()
function
in the getNextPrime()
function will greatly reduce the readability of getNextPrime()
.
12.2.2. Loop Unrolling
The last compiler optimization strategy we discuss in this section is loop unrolling. Let’s revisit the isPrime()
function:
// helper function: checks to see if a number is prime
int isPrime(int x) {
int i;
int max = sqrt(x) + 1;
// no prime number is less than 2
for (i = 2; i < max; i++) {
// if the number is divisible by i
if (x % i == 0) {
return 0; // it's not prime
}
}
return 1; // otherwise it is
}
The for
-loop executes a total of max
times, where max
is one more than the square root of integer x
. At the assembly level, every
execution of the loop checks to see i
is less than max
. If so, the instruction pointer jumps to
the body of the loop which computes the modulo operation. If the modulo operation results in 0, the program immediately exits the loop
and returns 0. Otherwise, the loop continues execution. While branch predictors are fairly good at predicting what a conditional expression
evaluates to (especially inside loops), wrong guesses can result in a hit to performance, due to disruptions in the instruction pipeline.
Loop unrolling is an optimization that compilers perform to reduce the impact of wrong guesses. In loop unrolling, the goal is to reduce the number of iterations of a loop by a factor of n by increasing the workload that each iteration performs by a factor of n. When a loop is unrolled by a factor of 2, the number of iterations in the loop is cut by half while the amount work performed per iteration is doubled.
Let’s manually apply 2-factor loop unrolling to our isPrime()
function (available in optExample3.c):
// helper function: checks to see if a number is prime
int isPrime(int x) {
int i;
int max = sqrt(x)+1;
// no prime number is less than 2
for (i = 2; i < max; i+=2) {
// if the number is divisible by i or i+1
if ( (x % i == 0) || (x % (i+1) == 0) ) {
return 0; // it's not prime
}
}
return 1; // otherwise it is
}
Notice that while we have halved the number of iterations that the for loop takes, each iteration of the loop now performs two modulo checks,
doubling the amount of work per iteration. Recompiling and re-running the program results in marginally improved times (see Table 2). The
readability of the code is also reduced. A better way to utilize loop unrolling is to invoke the -funroll-loops
compiler optimization
flag, which tells the compiler to unroll loops whose iterations can be determined at compile time. The -funroll-all-loops
compiler flag is
a more aggressive option, and unrolls all loops whether or not the compiler is certain of the number of iterations. Table 2 shows the
run-times of the manual 2-factor loop unrolling (available in optExample3.c
) compared to adding the -funroll-loops
and -funroll-all-loops
compiler optimization flags to the previous program (optExample2.c).
Version | Unoptimized | -O1 |
-O2 |
-O3 |
---|---|---|---|---|
Original (optExample.c) |
3.86 |
2.32 |
2.14 |
2.15 |
With loop-invariant code motion (optExample2.c) |
1.83 |
1.63 |
1.71 |
1.63 |
With manual 2-factor loop unrolling (optExample3.c) |
1.65 |
1.53 |
1.45 |
1.45 |
With |
1.82 |
1.48 |
1.46 |
1.46 |
With |
1.81 |
1.47 |
1.47 |
1.46 |
While manual loop unrolling does result in some performance improvement, the compiler’s built-in loop unrolling flags when combined with the other optimization flags yield comparably good performance. If a programmer wants to incorporate loop unrolling optimizations into their code, they should default to using the appropriate compiler flags, and not manually unroll loops themselves.