14.7. Implicit Threading with OpenMP

Thus far, we presented shared memory programming using POSIX threads. While Pthreads are great for simple applications, they become increasingly difficult to use as programs themselves become more complex. POSIX threads are an example of explicit parallel programming of threads, requiring a programmer to specify exactly what each thread is required to do and when each thread should start and stop.

With Pthreads, it can also be challenging to incrementally add parallelism to an existing sequential program. That is, one must often rewrite the program entirely to use threads, which is often not desirable when attempting to parallelize a large, existing codebase.

The Open Multiprocessing (OpenMP) library implements an implicit alternative to Pthreads. OpenMP is built into gcc and other popular compilers such as LLVM and Clang, and can be used with the C, C++, and Fortran programming languages. A key advantage of OpenMP is that it enables programmers to parallelize components of existing, sequential C code by adding pragmas (which are special compiler directives) to parts of the code. Pragmas specific to OpenMP begin with #pragma omp.

While a detailed coverage of OpenMP is out of the scope of this book, we cover some common pragmas, and show how several can be used in the context of some sample applications.

14.7.1. Common pragmas

Here are some of the most commonly used pragmas in OpenMP programs:

#pragma omp parallel

This pragma creates a team of threads and has each thread run the code in its scope (usually a function call) on each thread. An invocation of this pragma is usually equivalent to an invocation to the pthread_create() and pthread_join() function pairing discussed in our original discussion on Pthreads. The pragma may have a number of clauses including the following:

  • num_threads: specifies the number of threads to create

  • private: a list of variables that should be private (or local) to each thread. Variables that should be private to a thread can also be declared within the scope of the pragma (see below for an example). Each thread gets its own copy of each variable.

  • shared: a listing of variables that should be shared amongst the threads. There is one copy of the variable that is shared amongst all threads.

  • default: indicates whether the determination of which variables should be left up to the compiler. In most cases, we want to use default(none) and specify explicitly which variables should be shared, and which should be private.

#pragma omp for

Specifies that each thread execute a subset of iterations of a for loop. While the scheduling of the loops is up to the system, the default is usually the "chunking" method first discussed in the scalar multiplication example. This is a static form of scheduling: each thread gets an assigned chunk, and then processes the iterations in its chunk. However, OpenMP also makes dynamic scheduling easy. In dynamic scheduling, each thread gets a number of iterations, and requests a new set upon completing processing their iteration. The scheduling policy can be set using the following clause:

  • schedule(dynamic): specifies that a dynamic form of scheduling should be used. While this is advantageous in some cases, the static (default) form of scheduling is usually faster.

#pragma omp parallel for

This pragma is a combination of the omp parallel and the omp for pragmas. Unlike the omp for pragma, the omp parallel for pragma also generates a team of threads before assigning each thread a set of iterations of the loop.

#pragma omp critical

This pragma is used to specify that the code under its scope should be treated as a critical section — that is, only one thread should execute the section of code at a time to ensure correct behavior.

There are also several functions that a thread can access that are often useful for execution. For example:

omp_get_num_threads()

returns the number of threads in the current team that is being executed.

omp_set_num_threads()

sets the number of threads that a team should have

omp_get_thread_num()

returns the identifier of the calling thread

omp parallel for directive only works with for loops!

Keep in mind that the omp parallel for pragma ONLY works with for loops. Other types of loops, such as while loops and do-while loops are not supported.

14.7.2. Hello Threading: OpenMP flavored

Let’s revisit our "Hello World" (hellothreads.c) program, now using OpenMP instead of Pthreads:

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

void HelloWorld( void ) {
    long myid = omp_get_thread_num();
    printf( "Hello world! I am thread %ld\n", myid );
}

int main( int argc, char** argv ) {
    long nthreads;

    if (argc !=2) {
        fprintf(stderr, "usage: %s <n>\n", argv[0]);
        fprintf(stderr, "where <n> is the number of threads\n");
        return 1;
    }

    nthreads = strtol( argv[1], NULL, 10 );

    #pragma omp parallel num_threads(nthreads)
        HelloWorld();

    return 0;
}

Note that the OpenMP program is much shorter than the Pthreads version. To access the OpenMP library functions, we include the header file omp.h. The omp parallel num_threads(nthreads) pragma in main() creates a set of threads, where each thread calls the HelloWorld() function. The clause num_threads(nthreads) specifies that a total of nthreads should be generated. The pragma also joins each created thread back to a single-threaded process. In other words, all the low level work of creating and joining threads is abstracted away from the programmer and is accomplished with the inclusion of just one pragma. For this reason, OpenMP is considered an implicit threading library.

OpenMP also abstracts away the need to explicitly manage thread ids. In the context of HelloWorld(), the omp_get_thread_num() function extracts the unique id associated with the thread that is running it.

Compiling the code

Let’s compile and run this program by passing the -fopenmp flag to the compiler, which signals that we’re compiling with OpenMP:

$ gcc -o hello_mp hello_mp.c -fopenmp

$ ./hello_mp 4
Hello world! I am thread 2
Hello world! I am thread 3
Hello world! I am thread 0
Hello world! I am thread 1

Since the execution of threads changes can change with subsequent runs, re-running this program results in a different sequence of messages.

$ ./hello_mp 4
Hello world! I am thread 3
Hello world! I am thread 2
Hello world! I am thread 1
Hello world! I am thread 0

This behavior is consistent with our example with Pthreads.

14.7.3. A more complex example: CountSort in OpenMP

A powerful advantage of OpenMP is that it enables programmers to incrementally parallelize their code. To see this in action, let’s parallelize the more complex CountSort algorithm discussed earlier in this chapter (the serial code is located here: countSort.c). Recall that this algorithm sorts arrays containing a small range of values. The main function of the serial program looks like the following:

int main( int argc, char **argv ) {
    //parse args (omitted for brevity)

    srand(10); //use of static seed ensures the output is the same every run

    //generate random array of elements of specified length (omitted for brevity)

    //allocate counts array and initializes all elements to zero.
    int counts[MAX] = {0};

    countElems(counts, array, length); //calls step 1
    writeArray(counts, array); //calls step2

    free(array); //free memory

    return 0;
}

The main() function, after doing some command line parsing and generating a random array, calls the countsElems() function followed by the writeArray() function.

Parallelizing CountElems using OpenMP

There are several ways to parallelize the above program. One way (shown below) uses the omp parallel pragma in the context of the countElems() and writeArray() functions. As a result, no changes need to be made to the main() function. A full version of the program can be accessed in countSort_mp.c.

First, let’s examine how to parallelize the countElems() function using OpenMP:

void countElems(int *counts, int *array, long length) {

    #pragma omp parallel default(none) shared(counts, array, length)
    {
        int val, i, local[MAX] = {0};
        #pragma omp for
        for (i = 0; i < length; i++) {
            val = array[i];
            local[val]++;
        }

       #pragma omp critical
       {
           for (i = 0; i < MAX; i++) {
               counts[i] += local[i];
           }
       }
   }
}

In this version of the code, there are three pragmas employed:

  • The #pragma omp parallel indicates that a team of threads should be created. The omp_set_num_threads(nthreads) line in main() set the default size of the thread team to be nthreads. If the omp_set_num_threads() function is not used, then the number of threads assigned will equal the number of cores on the system. As a reminder, the omp parallel pragma implicitly creates threads at the beginning of the block and joins them at the end of the block. Braces ({}) are used to specify scope. The shared clause declares that variables counts, array and length are shared (global) amongst all the threads. Thus, variables val, i, and local[MAX] are declared locally in each thread.

  • The next pragma is the #pragma omp for which parallelizes the for loop, splitting the number of iterations amongst the number of threads. OpenMP calculates how best to split up the iterations of the loop. As previously mentioned, the default strategy is usually a chunking method, where each thread gets roughly the same number of iterations to compute. Thus, each thread reads a component of the shared array array, and accumulates its counts in its local array local.

  • The #pragma omp critical indicates that the code in the scope of the critical section should be executed by exactly one thread at a time. This is equivalent to the mutex that was employed in the Pthreads version of this program. In this program, each thread increments the shared counts array one at a time.

Let’s get a sense of the performance of this function by running it 100 million elements:

$ ./countElems_mp 100000000 1
Run Time for Phase 1 is 0.249893

$ ./countElems_mp 100000000 2
Run Time for Phase 1 is 0.124462

$ ./countElems_mp 100000000 4
Run Time for Phase 1 is 0.068749

This is excellent performance, with our function getting a speedup of 2 on two threads, and a speedup of 3.63 on four threads. We get even better performance that the Pthreads implementation!

The writeArray() function in OpenMP

Parallelizing the writeArray() function is much harder. The code below shows one possible solution:

void writeArray(int *counts, int *array) {
    int i;

    //assumed the number of threads is no more than MAX
    #pragma omp parallel for schedule(dynamic)
    for (i = 0; i < MAX; i++) {
        int j = 0, amt, start = 0;
        for (j = 0; j < i; j++) {  //calculate the "true" start position
            start += counts[j];
        }

        amt = counts[i]; //the number of array positions to fill

        //overwrite amt elements with value i, starting at position start
        for (j = start; j < start + amt; j++) {
            array[j] = i;
        }
    }
}

Prior to parallelizing, we made a change to this function, because the old version of writeArray() caused j to have a dependency on the previous iterations of the loop. In this version, each thread calculates its unique start value based on the sum of all the previous elements in counts.

Once this dependency is removed, the parallelization is pretty straightforward:

  • The #pragma omp parallel for generates a team of threads and parallelizes the for loop by assigning each thread a subset of the iterations of the loop. As a reminder, this pragma is a combination of the omp parallel and the omp for pragmas (which were used in the parallelization of countElems). A chunking approach to scheduling threads (like shown in the countElems function above) is not appropriate here, since it is possible that each element in counts has a radically different frequency. Therefore, the threads will not have equal work, resulting in some threads being assigned more work that others. Therefore, the schedule(dynamic) clause is employed, so that way each thread completes the iteration it is assigned before requesting a new iteration from the thread manager.

  • Since each thread is writing to distinct array locations, mutual exclusion is not needed for this function.

Notice how much cleaner the OpenMP code is than the POSIX thread implementation. The code is very readable, and required very little modification. This is one of the powers of abstraction, in which the implementation details are hidden from view from the programmer.

However, a necessary trade-off for abstraction is control. The programmer assumes the compiler is "smart" enough to take care of the particulars of parallelization, and thus has an easier time parallelizing their application. However, the programmer no longer makes detailed decisions the particulars of that parallelization. Without a clear idea of how OpenMP pragmas execute under the hood, it can be difficult to debug an OpenMP application or know which pragma is the most appropriate to use at a given time.

14.7.4. Learning more about OpenMP

A deeper discussion of OpenMP is beyond the scope of this book. The references section list some other useful free resources1,2 for learning OpenMP.

References:

  1. Blaise Barney. "OpenMP". https://computing.llnl.gov/tutorials/openMP/

  2. Richard Brown and Libby Shoop. "Multicore Programming with OpenMP". CSinParallel: Parallel Computing in the Computer Science curriculum. http://selkie.macalester.edu/csinparallel/modules/MulticoreProgramming/build/html/index.html