12.2. Other Compiler Optimizations: Loop Unrolling and Function Inlining

The loop-invariant code motion optimization described in the previous section was a simple change that resulted in a massive reduction in execution time. However, such optimizations are situationally dependent, and may not always result in improvements to performance. In most cases, loop-invariant code motion is taken care of by the compiler.

Code today is more often read than it is written. In most cases, fractional performance gains are not worth the hit to code readability. In general, a programmer should let the compiler optimize whenever possible. In this section, we cover some optimization techniques that were previously manually implemented by programmers but are today commonly implemented by compilers.

There are several sources online that advocate for the manual implementation of the techniques we describe in the following sections. However, we encourage readers to check whether their compilers support the following optimizations before attempting to manually implement them in their code. All the optimizations described in this section are implemented in GCC, but may not be available in older compilers.

12.2.1. Function Inlining

One optimization step that compilers attempt to perform is function inlining, which replaces calls to a function with the body of the function. For example, in the main function, a compiler inlining the allocateArray function will replace the call to allocateArray with a direct call to malloc:

Table 1. Example of compiler inlining the allocateArray function.
Original Version Version with allocateArray in-lined
int main(int argc, char **argv) {
    // omitted for brevity
    // some variables shortened for space considerations
    int lim = strtol(argv[1], NULL, 10);

    // allocation of array
    int *a = allocateArray(lim);

    // generates sequence of primes
    int len = genPrimeSequence(a, lim);

    return 0;
}
int main(int argc, char **argv) {
    // omitted for brevity
    // some variables shortened for space considerations
    int lim = strtol(argv[1], NULL, 10);

    // allocation of array (in-lined)
    int *a = malloc(lim * sizeof(int));

    // generates sequence of primes
    int len = genPrimeSequence(a, lim);

    return 0;
}

Inlining functions can result in some runtime savings for a program. Recall that every time a program calls a function, many instructions associated with function creation and destruction are necessarily generated. Inlining functions enables the compiler to eliminate these excessive calls, and makes it easier for the compiler to identify other potential improvements, including constant propagation, constant folding, and dead code elimination. In the case of the optExample program, inlining likely allows the compiler to replace the call to sqrt with the fsqrt instruction, and subsequently move it outside the loop.

The -finline-functions flag suggests to GCC that functions should be inlined. This optimization is turned on at level 3. Even though -finline-functions can be used independently of the -O3 flag, it is a suggestion to the compiler to look for functions to inline. Likewise, the static inline keyword can be used to suggest to the compiler that a particular function should be inlined. Keep in mind that the compiler will not inline all functions, and that function inlining is not guaranteed to make code faster.

Programmers should generally avoid inlining functions manually. Inlining functions carries a high risk of significantly reducing the readability of code, increasing the likelihood of errors, and making it harder to update and maintain functions. For example, trying to inline the isPrime function in the getNextPrime function will greatly reduce the readability of getNextPrime.

12.2.2. Loop Unrolling

The last compiler optimization strategy we discuss in this section is loop unrolling. Let’s revisit the isPrime function:

// helper function: checks to see if a number is prime
int isPrime(int x) {
    int i;
    int max = sqrt(x) + 1;

    // no prime number is less than 2
    for (i = 2; i < max; i++) {
        // if the number is divisible by i
        if (x % i == 0) {
            return 0; // it's not prime
        }
    }
    return 1; // otherwise it is
}

The for loop executes a total of max times, where max is one more than the square root of integer x. At the assembly level, every execution of the loop checks to see whether i is less than max. If so, the instruction pointer jumps to the body of the loop, which computes the modulo operation. If the modulo operation results in 0, the program immediately exits the loop and returns 0. Otherwise, the loop continues execution. While branch predictors are fairly good at predicting what a conditional expression evaluates to (especially inside loops), wrong guesses can result in a hit to performance, due to disruptions in the instruction pipeline.

Loop unrolling is an optimization that compilers perform to reduce the impact of wrong guesses. In loop unrolling, the goal is to reduce the number of iterations of a loop by a factor of n by increasing the workload that each iteration performs by a factor of n. When a loop is unrolled by a factor of 2, the number of iterations in the loop is cut by half, whereas the amount work performed per iteration is doubled.

Let’s manually apply 2-factor loop unrolling to our isPrime function (available in optExample3.c):

// helper function: checks to see if a number is prime
int isPrime(int x) {
    int i;
    int max = sqrt(x)+1;

    // no prime number is less than 2
    for (i = 2; i < max; i+=2) {
        // if the number is divisible by i or i+1
        if ( (x % i == 0) || (x % (i+1) == 0) ) {
            return 0; // it's not prime
        }
    }
    return 1; // otherwise it is
}

Notice that even though we have halved the number of iterations that the for loop takes, each iteration of the loop now performs two modulo checks, doubling the amount of work per iteration. Recompiling and rerunning the program results in marginally improved times (see Table 2). The readability of the code is also reduced. A better way to utilize loop unrolling is to invoke the -funroll-loops compiler optimization flag, which tells the compiler to unroll loops whose iterations can be determined at compile time. The -funroll-all-loops compiler flag is a more aggressive option that unrolls all loops whether or not the compiler is certain of the number of iterations. Table 2 shows the runtimes of the manual 2-factor loop unrolling (available in optExample3.c) compared to adding the -funroll-loops and -funroll-all-loops compiler optimization flags to the previous program (optExample2.c).

Table 2. Time in Seconds to Produce 5,000,000 Prime Numbers
Version Unoptimized -O1 -O2 -O3

Original (optExample.c)

3.86

2.32

2.14

2.15

With loop-invariant code motion (optExample2.c)

1.83

1.63

1.71

1.63

With manual 2-factor loop unrolling (optExample3.c)

1.65

1.53

1.45

1.45

With -funroll-loops (optExample2.c)

1.82

1.48

1.46

1.46

With -funroll-all-loops (optExample2.c)

1.81

1.47

1.47

1.46

While manual loop unrolling does result in some performance improvement, the compiler’s built-in loop unrolling flags when combined with the other optimization flags yield comparable performance. If a programmer wants to incorporate loop unrolling optimizations into their code, they should default to using the appropriate compiler flags, and not manually unroll loops themselves.