Dive Into Systems

14.7. Implicit Threading with OpenMP

Thus far, we have presented shared memory programming using POSIX threads. Although Pthreads are great for simple applications, they become increasingly difficult to use as programs themselves become more complex. POSIX threads are an example of explicit parallel programming of threads, requiring a programmer to specify exactly what each thread is required to do and when each thread should start and stop.

With Pthreads, it can also be challenging to incrementally add parallelism to an existing sequential program. That is, one must often rewrite the program entirely to use threads, which is often not desirable when attempting to parallelize a large, existing codebase.

The Open Multiprocessing (OpenMP) library implements an implicit alternative to Pthreads. OpenMP is built into GCC and other popular compilers such as LLVM and Clang, and can be used with the C, C++, and Fortran programming languages. A key advantage of OpenMP is that it enables programmers to parallelize components of existing, sequential C code by adding pragmas (special compiler directives) to parts of the code. Pragmas specific to OpenMP begin with #pragma omp.

Detailed coverage of OpenMP is outside the scope of this book, but we cover some common pragmas, and show how several can be used in the context of some sample applications.

14.7.1. Common Pragmas

Here are some of the most commonly used pragmas in OpenMP programs:

#pragma omp parallel

This pragma creates a team of threads and has each thread run the code in its scope (usually a function call) on each thread. An invocation of this pragma is usually equivalent to an invocation of the pthread_create and pthread_join function pairing discussed in our original discussion on Pthreads. The pragma may have a number of clauses, including the following:

num_threads specifies the number of threads to create.
private is a list of variables that should be private (or local) to each thread. Variables that should be private to a thread can also be declared within the scope of the pragma (see below for an example). Each thread gets its own copy of each variable.
shared is a listing of variables that should be shared amongst the threads. There is one copy of the variable that is shared amongst all threads.
default indicates whether the determination of which variables should be shared is left up to the compiler. In most cases, we want to use default(none) and specify explicitly which variables should be shared, and which should be private.

#pragma omp for

Specifies that each thread execute a subset of iterations of a for loop. Although the scheduling of the loops is up to the system, the default is usually the "chunking" method first discussed in the scalar multiplication example. This is a static form of scheduling: each thread gets an assigned chunk, and then processes the iterations in its chunk. However, OpenMP also makes dynamic scheduling easy. In dynamic scheduling, each thread gets a number of iterations, and requests a new set upon completing processing their iteration. The scheduling policy can be set using the following clause:

schedule(dynamic): specifies that a dynamic form of scheduling should be used. While this is advantageous in some cases, the static (default) form of scheduling is usually faster.

#pragma omp parallel for

This pragma is a combination of the omp parallel and the omp for pragmas. Unlike the omp for pragma, the omp parallel for pragma also generates a team of threads before assigning each thread a set of iterations of the loop.

#pragma omp critical

This pragma is used to specify that the code under its scope should be treated as a critical section — that is, only one thread should execute the section of code at a time to ensure correct behavior.

There are also several functions that a thread can access that are often useful for execution. For example:

omp_get_num_threads: returns the number of threads in the current team that is being executed.
omp_set_num_threads: sets the number of threads that a team should have.
omp_get_thread_num: returns the identifier of the calling thread.

The omp parallel for directive works only with for loops!

Keep in mind that the omp parallel for pragma works only with for loops. Other types of loops, such as while loops and do-while loops, are not supported.

14.7.2. Hello Threading: OpenMP flavored

Let’s revisit our "Hello World" (hellothreads.c) program, now using OpenMP instead of Pthreads:

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

void HelloWorld( void ) {
    long myid = omp_get_thread_num();
    printf( "Hello world! I am thread %ld\n", myid );
}

int main( int argc, char** argv ) {
    long nthreads;

    if (argc !=2) {
        fprintf(stderr, "usage: %s <n>\n", argv[0]);
        fprintf(stderr, "where <n> is the number of threads\n");
        return 1;
    }

    nthreads = strtol( argv[1], NULL, 10 );

    #pragma omp parallel num_threads(nthreads)
        HelloWorld();

    return 0;
}

Note that the OpenMP program is much shorter than the Pthreads version. To access the OpenMP library functions, we include the header file omp.h. The omp parallel num_threads(nthreads) pragma in main creates a set of threads, where each thread calls the HelloWorld function. The clause num_threads(nthreads) specifies that a total of nthreads should be generated. The pragma also joins each created thread back to a single-threaded process. In other words, all the low-level work of creating and joining threads is abstracted away from the programmer and is accomplished with the inclusion of just one pragma. For this reason, OpenMP is considered an implicit threading library.

OpenMP also abstracts away the need to explicitly manage thread IDs. In the context of HelloWorld, the omp_get_thread_num function extracts the unique ID associated with the thread that is running it.

Compiling the code

Let’s compile and run this program by passing the -fopenmp flag to the compiler, which signals that we’re compiling with OpenMP:

$ gcc -o hello_mp hello_mp.c -fopenmp

$ ./hello_mp 4
Hello world! I am thread 2
Hello world! I am thread 3
Hello world! I am thread 0
Hello world! I am thread 1

Since the execution of threads can change with subsequent runs, rerunning this program results in a different sequence of messages:

$ ./hello_mp 4
Hello world! I am thread 3
Hello world! I am thread 2
Hello world! I am thread 1
Hello world! I am thread 0

This behavior is consistent with our example with Pthreads.

14.7.3. A More Complex Example: CountSort in OpenMP

A powerful advantage of OpenMP is that it enables programmers to incrementally parallelize their code. To see this in action, let’s parallelize the more complex CountSort algorithm discussed earlier in this chapter (the serial code is located here: countSort.c). Recall that this algorithm sorts arrays containing a small range of values. The main function of the serial program looks like the following:

int main( int argc, char **argv ) {
    //parse args (omitted for brevity)

    srand(10); //use of static seed ensures the output is the same every run

    //generate random array of elements of specified length
    //(omitted for brevity)

    //allocate counts array and initializes all elements to zero.
    int counts[MAX] = {0};

    countElems(counts, array, length); //calls step 1
    writeArray(counts, array); //calls step2

    free(array); //free memory

    return 0;
}

The main function, after doing some command line parsing and generating a random array, calls the countsElems function followed by the writeArray function.

Parallelizing CountElems Using OpenMP

There are several ways to parallelize the preceding program. One way (shown in the example that follows) uses the omp parallel pragma in the context of the countElems and writeArray functions. As a result, no changes need to be made to the main function. A full version of the program is available at: countSort_mp.c.

First, let’s examine how to parallelize the countElems function using OpenMP:

void countElems(int *counts, int *array, long length) {

    #pragma omp parallel default(none) shared(counts, array, length)
    {
        int val, i, local[MAX] = {0};
        #pragma omp for
        for (i = 0; i < length; i++) {
            val = array[i];
            local[val]++;
        }

       #pragma omp critical
       {
           for (i = 0; i < MAX; i++) {
               counts[i] += local[i];
           }
       }
   }
}

In this version of the code, three pragmas are employed:

The #pragma omp parallel pragma indicates that a team of threads should be created. The omp_set_num_threads(nthreads) line in main sets the default size of the thread team to be nthreads. If the omp_set_num_threads function is not used, then the number of threads assigned will equal the number of cores in the system. As a reminder, the omp parallel pragma implicitly creates threads at the beginning of the block and joins them at the end of the block. Braces ({}) are used to specify scope. The shared clause declares that the variables counts, array, and length are shared (global) among all the threads. Thus, the variables val, i, and local[MAX] are declared locally in each thread.
The next pragma is #pragma omp for, which parallelizes the for loop, splitting the number of iterations among the number of threads. OpenMP calculates how best to split up the iterations of the loop. As previously mentioned, the default strategy is usually a chunking method, wherein each thread gets roughly the same number of iterations to compute. Thus, each thread reads a component of the shared array array, and accumulates its counts in its local array local.
The #pragma omp critical pragma indicates that the code in the scope of the critical section should be executed by exactly one thread at a time. This is equivalent to the mutex that was employed in the Pthreads version of this program. Here, each thread increments the shared counts array one at a time.

Let’s get a sense of the performance of this function by running it with 100 million elements:

$ ./countElems_mp 100000000 1
Run Time for Phase 1 is 0.249893

$ ./countElems_mp 100000000 2
Run Time for Phase 1 is 0.124462

$ ./countElems_mp 100000000 4
Run Time for Phase 1 is 0.068749

This is excellent performance, with our function getting a speedup of 2 on two threads, and a speedup of 3.63 on four threads. We get even better performance than the Pthreads implementation!

The `writeArray` Function in OpenMP

Parallelizing the writeArray function is much harder. The following code shows one possible solution:

void writeArray(int *counts, int *array) {
    int i;

    //assumed the number of threads is no more than MAX
    #pragma omp parallel for schedule(dynamic)
    for (i = 0; i < MAX; i++) {
        int j = 0, amt, start = 0;
        for (j = 0; j < i; j++) {  //calculate the "true" start position
            start += counts[j];
        }

        amt = counts[i]; //the number of array positions to fill

        //overwrite amt elements with value i, starting at position start
        for (j = start; j < start + amt; j++) {
            array[j] = i;
        }
    }
}

Prior to parallelizing, we made a change to this function, because the old version of writeArray caused j to have a dependency on the previous iterations of the loop. In this version, each thread calculates its unique start value based on the sum of all the previous elements in counts.

When this dependency is removed, the parallelization is pretty straightforward. The #pragma omp parallel for pragma generates a team of threads and parallelizes the for loop by assigning each thread a subset of the iterations of the loop. As a reminder, this pragma is a combination of the omp parallel and the omp for pragmas (which were used in the parallelization of countElems).

A chunking approach to scheduling threads (as shown in the earlier countElems function) is not appropriate here, because it is possible that each element in counts has a radically different frequency. Therefore, the threads will not have equal work, resulting in some threads being assigned more work than others. Therefore, the schedule(dynamic) clause is employed, so that each thread completes the iteration it is assigned before requesting a new iteration from the thread manager.

Since each thread is writing to distinct array locations, mutual exclusion is not needed for this function.

Notice how much cleaner the OpenMP code is than the POSIX thread implementation. The code is very readable, and required very little modification. This is one of the powers of abstraction, in which the implementation details are hidden from the programmer.

However, a necessary trade-off for abstraction is control. The programmer assumes that the compiler is "smart" enough to take care of the particulars of parallelization and thus has an easier time parallelizing their application. However, the programmer no longer makes detailed decisions about the particulars of that parallelization. Without a clear idea of how OpenMP pragmas execute under the hood, it can be difficult to debug an OpenMP application or know which pragma is the most appropriate to use at a given time.

14.7.4. Learning More about OpenMP

A deeper discussion of OpenMP is beyond the scope of this book, but there are useful free resources^1,2 for learning OpenMP.

References:

Blaise Barney. "OpenMP". https://hpc.llnl.gov/tuts/openMP/
Richard Brown and Libby Shoop. "Multicore Programming with OpenMP". CSinParallel: Parallel Computing in the Computer Science curriculum. http://selkie.macalester.edu/csinparallel/modules/MulticoreProgramming/build/html/index.html