12.2. Other Compiler Optimizations: Loop Unrolling and Function Inlining
The loop-invariant code motion optimization described in the previous section was a simple change that resulted in a massive reduction in execution time. However, such optimizations are situationally dependent, and may not always result in improvements to performance. In most cases, loop-invariant code motion is taken care of by the compiler.
Code today is more often read than it is written. In most cases, fractional performance gains are not worth the hit to code readability. In general, a programmer should let the compiler optimize whenever possible. In this section, we cover some optimization techniques that were previously manually implemented by programmers but are today commonly implemented by compilers.
There are several sources online that advocate for the manual implementation of the techniques we describe in the following sections. However, we encourage readers to check whether their compilers support the following optimizations before attempting to manually implement them in their code. All the optimizations described in this section are implemented in GCC, but may not be available in older compilers.
12.2.1. Function Inlining
One optimization step that compilers attempt to perform is function
inlining, which replaces calls to a function with the body of
the function. For example, in the main
function, a compiler inlining the
allocateArray
function will replace the call to allocateArray
with a
direct call to malloc
:
Original Version | Version with allocateArray in-lined |
---|---|
|
|
Inlining functions can result in some runtime savings for a program. Recall
that every time a program calls a function, many instructions associated with
function creation and destruction are necessarily generated. Inlining
functions enables the compiler to eliminate these excessive calls, and makes it
easier for the compiler to identify other potential improvements, including
constant propagation, constant folding, and dead code elimination. In the case
of the optExample
program, inlining likely allows the compiler to replace
the call to sqrt
with the fsqrt
instruction, and subsequently move it
outside the loop.
The -finline-functions
flag suggests to GCC that functions should be
inlined. This optimization is turned on at level 3. Even though -finline-functions
can be used independently of the -O3
flag, it is a suggestion to the
compiler to look for functions to inline. Likewise, the static inline
keyword
can be used to suggest to the compiler that a particular function should be
inlined. Keep in mind that the compiler will not inline all functions, and
that function inlining is not guaranteed to make code faster.
Programmers should generally avoid inlining functions manually. Inlining
functions carries a high risk of significantly reducing the readability of
code, increasing the likelihood of errors, and making it harder to update and
maintain functions. For example, trying to inline the isPrime
function in
the getNextPrime
function will greatly reduce the readability of
getNextPrime
.
12.2.2. Loop Unrolling
The last compiler optimization strategy we discuss in this section is loop
unrolling. Let’s revisit the isPrime
function:
// helper function: checks to see if a number is prime
int isPrime(int x) {
int i;
int max = sqrt(x) + 1;
// no prime number is less than 2
for (i = 2; i < max; i++) {
// if the number is divisible by i
if (x % i == 0) {
return 0; // it's not prime
}
}
return 1; // otherwise it is
}
The for
loop executes a total of max
times, where max
is one more than
the square root of integer x
. At the assembly level, every execution of the
loop checks to see whether i
is less than max
. If so, the instruction pointer
jumps to the body of the loop, which computes the modulo operation. If the
modulo operation results in 0, the program immediately exits the loop and
returns 0. Otherwise, the loop continues execution. While branch predictors
are fairly good at predicting what a conditional expression evaluates to
(especially inside loops), wrong guesses can result in a hit to performance,
due to disruptions in the instruction pipeline.
Loop unrolling is an optimization that compilers perform to reduce the impact of wrong guesses. In loop unrolling, the goal is to reduce the number of iterations of a loop by a factor of n by increasing the workload that each iteration performs by a factor of n. When a loop is unrolled by a factor of 2, the number of iterations in the loop is cut by half, whereas the amount work performed per iteration is doubled.
Let’s manually apply 2-factor loop unrolling to our isPrime
function
(available in optExample3.c):
// helper function: checks to see if a number is prime
int isPrime(int x) {
int i;
int max = sqrt(x)+1;
// no prime number is less than 2
for (i = 2; i < max; i+=2) {
// if the number is divisible by i or i+1
if ( (x % i == 0) || (x % (i+1) == 0) ) {
return 0; // it's not prime
}
}
return 1; // otherwise it is
}
Notice that even though we have halved the number of iterations that the for
loop
takes, each iteration of the loop now performs two modulo checks, doubling the
amount of work per iteration. Recompiling and rerunning the program results in
marginally improved times (see Table 2). The readability of the code is
also reduced. A better way to utilize loop unrolling is to invoke the
-funroll-loops
compiler optimization flag, which tells the compiler to unroll
loops whose iterations can be determined at compile time. The
-funroll-all-loops
compiler flag is a more aggressive option that unrolls all
loops whether or not the compiler is certain of the number of iterations.
Table 2 shows the runtimes of the manual 2-factor loop unrolling
(available in optExample3.c
) compared to adding the -funroll-loops
and
-funroll-all-loops
compiler optimization flags to the previous program
(optExample2.c).
Version | Unoptimized | -O1 |
-O2 |
-O3 |
---|---|---|---|---|
Original (optExample.c) |
3.86 |
2.32 |
2.14 |
2.15 |
With loop-invariant code motion (optExample2.c) |
1.83 |
1.63 |
1.71 |
1.63 |
With manual 2-factor loop unrolling (optExample3.c) |
1.65 |
1.53 |
1.45 |
1.45 |
With |
1.82 |
1.48 |
1.46 |
1.46 |
With |
1.81 |
1.47 |
1.47 |
1.46 |
While manual loop unrolling does result in some performance improvement, the compiler’s built-in loop unrolling flags when combined with the other optimization flags yield comparable performance. If a programmer wants to incorporate loop unrolling optimizations into their code, they should default to using the appropriate compiler flags, and not manually unroll loops themselves.