14.7. Implicit Threading with OpenMP
Thus far, we have presented shared memory programming using POSIX threads. Although Pthreads are great for simple applications, they become increasingly difficult to use as programs themselves become more complex. POSIX threads are an example of explicit parallel programming of threads, requiring a programmer to specify exactly what each thread is required to do and when each thread should start and stop.
With Pthreads, it can also be challenging to incrementally add parallelism to an existing sequential program. That is, one must often rewrite the program entirely to use threads, which is often not desirable when attempting to parallelize a large, existing codebase.
The Open Multiprocessing (OpenMP) library implements an implicit alternative
to Pthreads. OpenMP is built into GCC and other popular compilers such as
LLVM and Clang, and can be used with the C, C++, and Fortran
programming languages. A key advantage of OpenMP is that it enables programmers
to parallelize components of existing, sequential C code by adding pragmas
(special compiler directives) to parts of the code. Pragmas specific
to OpenMP begin with #pragma omp
.
Detailed coverage of OpenMP is outside the scope of this book, but we cover some common pragmas, and show how several can be used in the context of some sample applications.
14.7.1. Common Pragmas
Here are some of the most commonly used pragmas in OpenMP programs:
#pragma omp parallel
-
This pragma creates a team of threads and has each thread run the code in its scope (usually a function call) on each thread. An invocation of this pragma is usually equivalent to an invocation of the
pthread_create
andpthread_join
function pairing discussed in our original discussion on Pthreads. The pragma may have a number of clauses, including the following:-
num_threads
specifies the number of threads to create. -
private
is a list of variables that should be private (or local) to each thread. Variables that should be private to a thread can also be declared within the scope of the pragma (see below for an example). Each thread gets its own copy of each variable. -
shared
is a listing of variables that should be shared amongst the threads. There is one copy of the variable that is shared amongst all threads. -
default
indicates whether the determination of which variables should be shared is left up to the compiler. In most cases, we want to usedefault(none)
and specify explicitly which variables should be shared, and which should be private.
-
#pragma omp for
-
Specifies that each thread execute a subset of iterations of a
for
loop. Although the scheduling of the loops is up to the system, the default is usually the "chunking" method first discussed in the scalar multiplication example. This is a static form of scheduling: each thread gets an assigned chunk, and then processes the iterations in its chunk. However, OpenMP also makes dynamic scheduling easy. In dynamic scheduling, each thread gets a number of iterations, and requests a new set upon completing processing their iteration. The scheduling policy can be set using the following clause:-
schedule(dynamic)
: specifies that a dynamic form of scheduling should be used. While this is advantageous in some cases, the static (default) form of scheduling is usually faster.
-
#pragma omp parallel for
-
This pragma is a combination of the
omp parallel
and theomp for
pragmas. Unlike theomp for
pragma, theomp parallel for
pragma also generates a team of threads before assigning each thread a set of iterations of the loop. #pragma omp critical
-
This pragma is used to specify that the code under its scope should be treated as a critical section — that is, only one thread should execute the section of code at a time to ensure correct behavior.
There are also several functions that a thread can access that are often useful for execution. For example:
omp_get_num_threads
-
returns the number of threads in the current team that is being executed.
omp_set_num_threads
-
sets the number of threads that a team should have.
omp_get_thread_num
-
returns the identifier of the calling thread.
The
omp parallel for directive works only with for loops!Keep in mind that the |
14.7.2. Hello Threading: OpenMP flavored
Let’s revisit our "Hello World" (hellothreads.c) program, now using OpenMP instead of Pthreads:
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
void HelloWorld( void ) {
long myid = omp_get_thread_num();
printf( "Hello world! I am thread %ld\n", myid );
}
int main( int argc, char** argv ) {
long nthreads;
if (argc !=2) {
fprintf(stderr, "usage: %s <n>\n", argv[0]);
fprintf(stderr, "where <n> is the number of threads\n");
return 1;
}
nthreads = strtol( argv[1], NULL, 10 );
#pragma omp parallel num_threads(nthreads)
HelloWorld();
return 0;
}
Note that the OpenMP program is much shorter than the Pthreads version. To
access the OpenMP library functions, we include the header file omp.h
.
The omp parallel num_threads(nthreads)
pragma in main
creates a set of threads, where each
thread calls the HelloWorld
function. The clause num_threads(nthreads)
specifies that a total of nthreads
should be generated. The pragma
also joins each created thread back to a single-threaded process. In other words, all the
low-level work of creating and joining threads is abstracted away from the
programmer and is accomplished with the inclusion of just one pragma.
For this reason, OpenMP is considered an implicit threading
library.
OpenMP also abstracts away the need to explicitly manage thread IDs.
In the context of HelloWorld
, the omp_get_thread_num
function
extracts the unique ID associated with the thread that is running it.
Compiling the code
Let’s compile and run this program by passing the -fopenmp
flag to the
compiler, which signals that we’re compiling with OpenMP:
$ gcc -o hello_mp hello_mp.c -fopenmp $ ./hello_mp 4 Hello world! I am thread 2 Hello world! I am thread 3 Hello world! I am thread 0 Hello world! I am thread 1
Since the execution of threads can change with subsequent runs, rerunning this program results in a different sequence of messages:
$ ./hello_mp 4 Hello world! I am thread 3 Hello world! I am thread 2 Hello world! I am thread 1 Hello world! I am thread 0
This behavior is consistent with our example with Pthreads.
14.7.3. A More Complex Example: CountSort in OpenMP
A powerful advantage of OpenMP is that it enables programmers to incrementally parallelize their code. To see this in action, let’s parallelize the more complex CountSort algorithm discussed earlier in this chapter (the serial code is located here: countSort.c). Recall that this algorithm sorts arrays containing a small range of values. The main function of the serial program looks like the following:
int main( int argc, char **argv ) {
//parse args (omitted for brevity)
srand(10); //use of static seed ensures the output is the same every run
//generate random array of elements of specified length
//(omitted for brevity)
//allocate counts array and initializes all elements to zero.
int counts[MAX] = {0};
countElems(counts, array, length); //calls step 1
writeArray(counts, array); //calls step2
free(array); //free memory
return 0;
}
The main
function, after doing some command line parsing and generating a
random array, calls the countsElems
function followed by the writeArray
function.
Parallelizing CountElems Using OpenMP
There are several ways to parallelize the preceding program. One way
(shown in the example that follows) uses the omp parallel
pragma in the context of
the countElems
and writeArray
functions. As a result, no changes need to
be made to the main
function. A full version of the program is available
at: countSort_mp.c.
First, let’s examine how to parallelize the countElems
function using OpenMP:
void countElems(int *counts, int *array, long length) {
#pragma omp parallel default(none) shared(counts, array, length)
{
int val, i, local[MAX] = {0};
#pragma omp for
for (i = 0; i < length; i++) {
val = array[i];
local[val]++;
}
#pragma omp critical
{
for (i = 0; i < MAX; i++) {
counts[i] += local[i];
}
}
}
}
In this version of the code, three pragmas are employed:
-
The
#pragma omp parallel
pragma indicates that a team of threads should be created. Theomp_set_num_threads(nthreads)
line inmain
sets the default size of the thread team to benthreads
. If theomp_set_num_threads
function is not used, then the number of threads assigned will equal the number of cores in the system. As a reminder, theomp parallel
pragma implicitly creates threads at the beginning of the block and joins them at the end of the block. Braces ({}
) are used to specify scope. Theshared
clause declares that the variablescounts
,array
, andlength
are shared (global) among all the threads. Thus, the variablesval
,i
, andlocal[MAX]
are declared locally in each thread. -
The next pragma is
#pragma omp for
, which parallelizes thefor
loop, splitting the number of iterations among the number of threads. OpenMP calculates how best to split up the iterations of the loop. As previously mentioned, the default strategy is usually a chunking method, wherein each thread gets roughly the same number of iterations to compute. Thus, each thread reads a component of the shared arrayarray
, and accumulates its counts in its local arraylocal
. -
The
#pragma omp critical
pragma indicates that the code in the scope of the critical section should be executed by exactly one thread at a time. This is equivalent to the mutex that was employed in the Pthreads version of this program. Here, each thread increments the sharedcounts
array one at a time.
Let’s get a sense of the performance of this function by running it with 100 million elements:
$ ./countElems_mp 100000000 1 Run Time for Phase 1 is 0.249893 $ ./countElems_mp 100000000 2 Run Time for Phase 1 is 0.124462 $ ./countElems_mp 100000000 4 Run Time for Phase 1 is 0.068749
This is excellent performance, with our function getting a speedup of 2 on two threads, and a speedup of 3.63 on four threads. We get even better performance than the Pthreads implementation!
The writeArray
Function in OpenMP
Parallelizing the writeArray
function is much harder. The following code
shows one possible solution:
void writeArray(int *counts, int *array) {
int i;
//assumed the number of threads is no more than MAX
#pragma omp parallel for schedule(dynamic)
for (i = 0; i < MAX; i++) {
int j = 0, amt, start = 0;
for (j = 0; j < i; j++) { //calculate the "true" start position
start += counts[j];
}
amt = counts[i]; //the number of array positions to fill
//overwrite amt elements with value i, starting at position start
for (j = start; j < start + amt; j++) {
array[j] = i;
}
}
}
Prior to parallelizing, we made a change to this function, because
the old version of writeArray
caused
j
to have a dependency on the previous iterations of the loop. In this
version, each thread calculates its unique start
value based on the
sum of all the previous elements in counts
.
When this dependency is removed, the parallelization is pretty straightforward.
The #pragma omp parallel for
pragma generates a team of threads and
parallelizes the for
loop by assigning each thread a subset of the iterations
of the loop. As a reminder, this pragma is a combination of the omp parallel
and the omp for
pragmas (which were used in the parallelization of
countElems
).
A chunking approach to scheduling threads (as shown in the earlier
countElems
function) is not appropriate here, because it is possible that
each element in counts
has a radically different frequency. Therefore, the
threads will not have equal work, resulting in some threads being assigned more
work than others. Therefore, the schedule(dynamic)
clause is employed, so
that each thread completes the iteration it is assigned before requesting a new
iteration from the thread manager.
Since each thread is writing to distinct array locations, mutual exclusion is not needed for this function.
Notice how much cleaner the OpenMP code is than the POSIX thread implementation. The code is very readable, and required very little modification. This is one of the powers of abstraction, in which the implementation details are hidden from the programmer.
However, a necessary trade-off for abstraction is control. The programmer assumes that the compiler is "smart" enough to take care of the particulars of parallelization and thus has an easier time parallelizing their application. However, the programmer no longer makes detailed decisions about the particulars of that parallelization. Without a clear idea of how OpenMP pragmas execute under the hood, it can be difficult to debug an OpenMP application or know which pragma is the most appropriate to use at a given time.
14.7.4. Learning More about OpenMP
A deeper discussion of OpenMP is beyond the scope of this book, but there are useful free resources1,2 for learning OpenMP.
References:
-
Blaise Barney. "OpenMP". https://hpc.llnl.gov/tuts/openMP/
-
Richard Brown and Libby Shoop. "Multicore Programming with OpenMP". CSinParallel: Parallel Computing in the Computer Science curriculum. http://selkie.macalester.edu/csinparallel/modules/MulticoreProgramming/build/html/index.html