12.2. Other compiler optimizations: Loop unrolling and Function In-Lining

The loop-invariant code motion optimization described in the last section was a simple change that resulted in a massive reduction of execution time. However, such optimizations are situationally dependent, and may not always result in improvements to performance. In most cases, loop-invariant code motion is taken care of by the compiler.

Code today is more often read than it is written. In most cases, fractional performance gains are not worth the hit to code readability. In general, a programmer should let the compiler optimize whenever possible. In this section, we cover some optimization techniques that were previously manually implemented by programmers, but are today commonly implemented by compilers.

There are several sources on-line that advocate for the manual implementation of the techniques described below. However, we encourage readers to check if their compilers support the following optimizations, before attempting to manually implement them in their code. All the optimizations described in this section are implemented in gcc, but may not be available in older compilers.

12.2.1. Function in-lining

One optimization step that compilers attempt to perform is function in-lining. Function in-lining replaces calls to a function with the body of the function. For example, in the main() function, a compiler in-lining the allocateArray() function will replace the call to allocateArray() with a direct call to malloc():

Table 1. Example of compiler in-lining the allocateArray() function.
Original Version Version with allocateArray() in-lined
int main(int argc, char **argv) {
    //omitted for brevity
    //some variables shortened for
    //space considerations
    int lim = strtol(argv[1], NULL, 10);

    //allocation of array
    int *a = allocateArray(lim);

    //generates sequence of primes
    int len = genPrimeSequence(a, lim);

    return 0;
}
int main(int argc, char **argv) {
    //omitted for brevity
    //some variables shortened for
    //space considerations
    int lim = strtol(argv[1], NULL, 10);

    //allocation of array (in-lined)
    int *a = malloc(lim * sizeof(int));

    //generates sequence of primes
    int len = genPrimeSequence(a, lim);

    return 0;
}

In-lining functions can result in some run-time savings for a program. Recall that every time a program calls a function, many instructions associated with function creation and destruction are necessarily generated. In-lining functions enables the compiler to eliminate these excessive calls, and makes it easier for the compiler to identify other potential improvements, including constant propagation, constant folding and dead code elimination. In the case of the optExample program, in-lining likely allows the compiler to replace the call to sqrt() with the fsqrt instruction, and subsequently move it outside the loop.

The -finline-functions flag suggests to gcc that functions should be in-lined. This optimization is turned on at level 3. While -finline-functions can be used independently of the -O3 flag, it is a suggestion to the compiler to look for functions to in-line. Likewise the static inline keyword can be used to suggest to the compiler that a particular function should be in-lined. Keep in mind that the compiler will not in-line all functions, and that function in-lining is not guaranteed to make code faster.

Programmers should generally avoid in-lining functions manually. In-lining functions carries a high risk of significantly reducing the readability of code, increasing the likelihood of errors, and making it harder to update and maintain functions. For example, trying to in-line the isPrime() function in the getNextPrime() function will greatly reduce the readability of getNextPrime().

12.2.2. Loop Unrolling

The last compiler optimization strategy we discuss in this section is loop unrolling. Let’s revisit the isPrime() function:

//helper function: checks to see if a number is prime
int isPrime(int x) {
    int i;
    int max = sqrt(x) + 1;
    for (i = 2; i < max; i++) { //no prime number is less than 2
        if (x % i == 0) { //if the number is divisible by i
            return 0; //it is not prime
        }
    }
    return 1; //otherwise it is
}

The for-loop executes a total of max times, where max is one more than the square root of integer x. At the assembly level, every execution of the loop checks to see i is less than max. If so, the instruction pointer jumps to the body of the loop which computes the modulo operation. If the modulo operation results in 0, the program immediately exits the loop and returns 0. Otherwise, the loop continues execution. While branch predictors are fairly good at predicting what a conditional expression evaluates to (especially inside loops), wrong guesses can result in a hit to performance, due to disruptions in the instruction pipeline.

Loop unrolling is an optimization that compilers perform to reduce the impact of wrong guesses. In loop unrolling, the goal is to reduce the number of iterations of a loop by a factor of n by increasing the workload that each iteration performs by a factor of n. When a loop is unrolled by a factor of 2, the number of iterations in the loop is cut by half while the amount work performed per iteration is doubled.

Let’s manually apply 2-factor loop unrolling to our isPrime() function (available in optExample3.c):

//helper function: checks to see if a number is prime
int isPrime(int x) {
    int i;
    int max = sqrt(x)+1;
    for (i = 2; i < max; i+=2) { //no prime number is less than 2
        if ( (x % i == 0) || (x % (i+1) == 0) ) { //if the number is divisible by i or i+1
            return 0; //it is not prime
        }
    }
    return 1; //otherwise it is
}

Notice that while we have halved the number of iterations that the for loop takes, each iteration of the loop now performs two modulo checks, doubling the amount of work per iteration. Recompiling and re-running the program results in marginally improved times (see Table 2). The readability of the code is also reduced. A better way to utilize loop unrolling is to invoke the -funroll-loops compiler optimization flag, which tells the compiler to unroll loops whose iterations can be determined at compile time. The -funroll-all-loops compiler flag is a more aggressive option, and unrolls all loops whether or not the compiler is certain of the number of iterations. Table 2 shows the run-times of the manual 2-factor loop unrolling (available in optExample3.c) compared to adding the -funroll-loops and -funroll-all-loops compiler optimization flags to the previous program (optExample2.c).

Table 2. Time to produce 500,000 prime numbers (in seconds)
Version Unoptimized -O1 -O2 -O3

Original (optExample.c)

3.86

2.32

2.14

2.15

With loop-invariant code motion (optExample2.c)

1.83

1.63

1.71

1.63

With manual 2-factor loop unrolling (optExample3.c)

1.65

1.53

1.45

1.45

With -funroll-loops (optExample2.c)

1.82

1.48

1.46

1.46

With -funroll-all-loops (optExample2.c)

1.81

1.47

1.47

1.46

While manual loop unrolling does result in some performance improvement, the compiler’s built-in loop unrolling flags when combined with the other optimization flags yield comparably good performance. If a programmer wants to incorporate loop unrolling optimizations into their code, they should default to using the appropriate compiler flags, and not manually unroll loops themselves.