14.2. Hello Threading! Writing Your First Multithreaded Program
In this section, we examine the ubiquitous POSIX thread library Pthreads. POSIX is an acronym for Portable Operating System Interface. It is an IEEE standard that specifies how UNIX systems look, act, and feel. The POSIX threads API is available on almost all UNIX-like operating systems, each of which meets the standard in its entirety or to some great degree. So, if you write parallel code using POSIX threads on a Linux machine, it will certainly work on other Linux machines, and it will likely work on machines running macOS or other UNIX variants.
Let’s begin by analyzing an example "Hello World" Pthreads program (hellothreads.c). For brevity, we have excluded error handling in the listing, though the downloadable version contains sample error handling.
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
/* The "thread function" passed to pthread_create. Each thread executes this
* function and terminates when it returns from this function. */
void *HelloWorld(void *id) {
/* We know the argument is a pointer to a long, so we cast it from a
* generic (void *) to a (long *). */
long *myid = (long *) id;
printf("Hello world! I am thread %ld\n", *myid);
return NULL; // We don't need our threads to return anything.
}
int main(int argc, char **argv) {
int i;
int nthreads; //number of threads
pthread_t *thread_array; //pointer to future thread array
long *thread_ids;
// Read the number of threads to create from the command line.
if (argc !=2) {
fprintf(stderr, "usage: %s <n>\n", argv[0]);
fprintf(stderr, "where <n> is the number of threads\n");
return 1;
}
nthreads = strtol(argv[1], NULL, 10);
// Allocate space for thread structs and identifiers.
thread_array = malloc(nthreads * sizeof(pthread_t));
thread_ids = malloc(nthreads * sizeof(long));
// Assign each thread an ID and create all the threads.
for (i = 0; i < nthreads; i++) {
thread_ids[i] = i;
pthread_create(&thread_array[i], NULL, HelloWorld, &thread_ids[i]);
}
/* Join all the threads. Main will pause in this loop until all threads
* have returned from the thread function. */
for (i = 0; i < nthreads; i++) {
pthread_join(thread_array[i], NULL);
}
free(thread_array);
free(thread_ids);
return 0;
}
Let’s examine this program in smaller components.
-
Notice the inclusion of the
pthread.h
header file, which declarespthread
types and functions. -
Next, the
HelloWorld
function defines the thread function that we later pass topthread_create
. A thread function is analogous to amain
function for a worker (created) thread — a thread begins execution at the start of its thread function and terminates when it reaches the end. Each thread executes the thread function using its private execution state (i.e., its own stack memory and register values). Note also that the thread function is of typevoid*
. Specifying an anonymous pointer in this context allows programmers to write thread functions that deal with arguments and return values of different types. -
Lastly, in the
main
function, the main thread initializes the program state before creating and joining the worker threads.
14.2.1. Creating and Joining Threads
The program first starts as a single-threaded process. As it executes the
main
function, it reads the number of threads to create, and it allocates
memory for two arrays: thread_array
and thread_ids
. The thread_array
array contains the set of addresses for each thread created. The thread_ids
array stores the set of arguments that each thread is passed. In this example,
each thread is passed the address of its rank (or ID, represented by thread_ids[i]
).
After all the preliminary variables are allocated and initialized, the main
thread executes the two major steps of multithreading:
-
The creation step, in which the main thread spawns one or more worker threads. After being spawned, each worker thread runs within its own execution context concurrently with the other threads and processes on the system.
-
The join step, in which the main thread waits for all the workers to complete before proceeding as a single-thread process. Joining a thread that has terminated frees the thread’s execution context and resources. Attempting to join a thread that hasn’t terminated blocks the caller until the thread terminates, similar to the semantics of the wait function for processes.
The Pthreads library offers a pthread_create
function for creating threads
and a pthread_join
function for joining them. The pthread_create
function has the following signature:
pthread_create(pthread_t *thread, const pthread_attr_t *attr,
void *thread_function, void *thread_args)
The function takes a pointer to a thread struct (of type pthread_t
), a pointer to an
attribute struct (normally set to NULL
), the name of the function the thread
should execute, and the array of arguments to pass to the thread function
when it starts.
The Hello World program calls pthread_create
in the main
function using:
pthread_create(&thread_array[i], NULL, HelloWorld, &thread_ids[i]);
Here:
-
&thread_array[i]
contains the address of thread i. Thepthread_create
function allocates apthread_t
thread object and stores its address at this location, enabling the programmer to reference the thread later (e.g., when joining it). -
NULL
specifies that the thread should be created with default attributes. In most programs, it is safe to leave this second parameter asNULL
. -
HelloWorld
names the thread function that the created thread should execute. This function behaves like the "main" function for the thread. For an arbitrary thread function (e.g.,function
), its prototype must match the formvoid * function(void *)
. -
&thread_ids[i]
specifies the address of the arguments to be passed to thread i. In this case,thread_ids[i]
contains a singlelong
representing the thread’s ID. Since the last argument topthread_create
must be a pointer, we pass the address of the thread’s ID.
To generate several threads that execute the HelloWorld
thread function, the
program assigns each thread a unique ID and creates
each thread within a for
loop:
for (i = 0; i < nthreads; i++) {
thread_ids[i] = i;
pthread_create(&thread_array[i], NULL, HelloWorld, &thread_ids[i]);
}
The OS schedules the execution of each created thread; the user cannot make any assumption on the order in which the threads will execute.
The pthread_join
function suspends the execution of its caller until
the thread it references terminates. Its signature is:
pthread_join(pthread_t thread, void **return_val)
The pthread_join
takes as input a pthread_t
struct,
indicating which thread to wait on, and an optional pointer
argument that specifies where the thread’s return value should be stored.
The Hello World program calls pthread_join
in main
using:
pthread_join(thread_array[t], NULL);
This line indicates that the main thread must wait on the termination of
thread t
. Passing NULL
as the second argument indicates that
the program does not use the thread’s return value.
In the previous program, main
calls pthread_join
in
a loop because all of the worker threads need to terminate before the main
function proceeds to clean up memory and terminate the process:
for (i = 0; i < nthreads; i++) {
pthread_join(thread_array[i], NULL);
}
14.2.2. The Thread Function
In the previous program, each spawned thread prints out Hello world! I am thread
n
, where n
is the thread’s unique id. After the thread prints out its
message, it terminates. Let’s take a closer look at the HelloWorld
function:
void *HelloWorld(void *id) {
long *myid = (long*)id;
printf("Hello world! I am thread %ld\n", *myid);
return NULL;
}
Recall that pthread_create
passes the arguments to the thread function
using the thread_args
parameter. In the pthread_create
function in
main
, the Hello World program specified that this parameter is in fact
the thread’s ID. Note that the parameter to HelloWorld
must be declared as a generic or
anonymous pointer (void *
). The
Pthreads library uses void *
to make pthread_create
more general purpose by not
prescribing a parameter type. As a programmer, the void *
is mildly
inconvenient given that it must be recast before use. Here, we know the
parameter is of type long *
because that’s what we passed to pthread_create
in main
. Thus, we can safely cast the value as a long *
and dereference
the pointer to access the long
value. Many parallel programs follow this
structure.
Similar to the thread function’s parameter, the Pthreads library avoids
prescribing the thread function’s return type by specifying another void *
— the programmer is free to return any pointer from the thread function. If the
program needs to access the thread’s return value, it can retrieve it via the
second argument to pthread_join
. In our example, the thread has no need to
return a value, so it simply returns a NULL
pointer.
14.2.3. Running the Code
The command that follows shows how to use GCC to compile hellothreads.c.
Building a Pthreads application requires that the -pthread
linker
flag be passed to GCC to ensure that the Pthreads functions and types are
accessible:
$ gcc -o hellothreads hellothreads.c -pthread
Running the program without a command line argument results in a usage message:
$ ./hellothreads usage: ./hellothreads <n> where <n> is the number of threads
Running the program with four threads yields the following output:
$ ./hellothreads 4 Hello world! I am thread 1 Hello world! I am thread 2 Hello world! I am thread 3 Hello world! I am thread 0
Notice that each thread prints its unique ID number. In this run, thread 1’s output displays first, followed by threads 2, 3, and 0. If we run the program again, we may see the output displayed in a different order:
$ ./hellothreads 4 Hello world! I am thread 0 Hello world! I am thread 1 Hello world! I am thread 2 Hello world! I am thread 3
Recall that the operating system’s scheduler determines the thread execution
order. From a user’s perspective, the order is effectively random due to
being influenced by many factors that vary outside the user’s control
(e.g., available system resources, the system receiving input, or OS
scheduling). Since all threads are running concurrently with one another and each thread
executes a call to printf
(which prints to stdout
), the first thread that
prints to stdout
will have its output show up first. Subsequent executions may
(or may not) result in different output.
Thread Execution Order
You should never make any assumptions about the order in which threads will execute. If the correctness of your program requires that threads run in a particular order, you must add synchronization to your program to prevent threads from running when they shouldn’t. |
14.2.4. Revisiting Scalar Multiplication
Let’s explore how to create a multithreaded implementation of the
scalar multiplication program
from the previous section. Recall that our general strategy for
parallelizing scalar_multiply
is to:
-
Create multiple threads,
-
Assign each thread a subset of the input array,
-
Instruct each thread to multiply the elements in its array subset by
s
.
The following is a thread function that accomplishes this task. Notice that we have
moved array
, length
, and s
to the global scope of the program.
long *array; //allocated in main
long length; //set in main (1 billion)
long nthreads; //number of threads
long s; //scalar
void *scalar_multiply(void *id) {
long *myid = (long *) id;
int i;
//assign each thread its own chunk of elements to process
long chunk = length / nthreads;
long start = *myid * chunk;
long end = start + chunk;
if (*myid == nthreads - 1) {
end = length;
}
//perform scalar multiplication on assigned chunk
for (i = start; i < end; i++) {
array[i] *= s;
}
return NULL;
}
Let’s break this down into parts. Recall that the first step is to assign each thread a component of the array. The following lines accomplish this task:
long chunk = length / nthreads;
long start = *myid * chunk;
long end = start + chunk;
The variable chunk
stores the number of elements that each thread is
assigned. To ensure that each thread gets roughly the same amount of work, we first
set the chunk size to the number of elements divided by the number of threads,
or length / nthreads
.
Next, we assign each thread a distinct range of elements to process. Each
thread computes its range’s start
and end
index using the chunk
size and
its unique thread ID.
For example, with four threads (with IDs 0-3) operating over
an array with 100 million elements, each thread is responsible for processing a
25 million element chunk
. Incorporating the thread ID assigns each thread a
unique subset of the input.
The next two lines account for the case in which length
is not evenly divisible by
the number of threads:
if (*myid == nthreads - 1) {
end = length;
}
Suppose that we specified three rather than four threads. The nominal chunk size would be 33,333,333 elements, leaving one element unaccounted for. The code in the previous example would assign the remaining element to the last thread.
Creating balanced input
The chunking code just shown is imperfect. In the case where the number of threads does not evenly divide the input, the remainder is assigned to the last thread. Consider a sample run in which the array has 100 elements, and 12 threads are specified. The nominal chunk size would be 8, and the remainder would be 4. With the example code, the first 11 threads will each have 8 assigned elements, whereas the last thread will be assigned 12 elements. Consequently, the last thread performs 50% more work than the other threads. A potentially better way to chunk this example is to have the first 4 threads process 9 elements each, while the last 8 threads process 8 elements each. This will result in better load balancing of the input across the threads. |
With an appropriate local start
and end
index computed, each thread is now ready to
perform scalar multiplication on its component of the array. The last portion of the scalar_multiply
function
accomplishes this:
for (i = start; i < end; i++) {
array[i] *= s;
}
14.2.5. Improving Scalar Multiplication: Multiple Arguments
A key weakness of the previous implementation is the wide use of global
variables. Our original discussion of global variables
showed that although useful, global variables should generally be avoided in C. To reduce
the number of global variables in the program, one solution is
to declare a t_arg
struct as follows in the global scope:
struct t_arg {
int *array; // pointer to shared array
long length; // num elements in array
long s; //scaling factor
long numthreads; // total number of threads
long id; // logical thread id
};
Our main function would, in addition to allocating array
and setting
local variables length
, nthreads
, and s
(our scaling factor),
allocate an array of t_arg
records:
long nthreads = strtol(argv[1], NULL, 10); //get number of threads
long length = strtol(argv[2], NULL, 10); //get length of array
long s = strtol( argv[3], NULL, 10 ); //get scaling factor
int *array = malloc(length*sizeof(int));
//allocate space for thread structs and identifiers
pthread_t *thread_array = malloc(nthreads * sizeof(pthread_t));
struct t_arg *thread_args = malloc(nthreads * sizeof(struct t_arg));
//Populate thread arguments for all the threads
for (i = 0; i < nthreads; i++){
thread_args[i].array = array;
thread_args[i].length = length;
thread_args[i].s = s;
thread_args[i].numthreads = nthreads;
thread_args[i].id = i;
}
Later in main
, when pthread_create
is called, the thread’s associated
t_args
struct is passed as an argument:
for (i = 0; i < nthreads; i++){
pthread_create(&thread_array[i], NULL, scalar_multiply, &thread_args[i]);
}
Lastly, our scalar_multiply
function would look like the following:
void * scalar_multiply(void* args) {
//cast to a struct t_arg from void*
struct t_arg * myargs = (struct t_arg *) args;
//extract all variables from struct
long myid = myargs->id;
long length = myargs->length;
long s = myargs->s;
long nthreads = myargs->numthreads;
int * ap = myargs->array; //pointer to array in main
//code as before
long chunk = length/nthreads;
long start = myid * chunk;
long end = start + chunk;
if (myid == nthreads-1) {
end = length;
}
int i;
for (i = start; i < end; i++) {
ap[i] *= s;
}
return NULL;
}
Implementing this program fully is an exercise we leave to the reader. Please note that error handling has been omitted for the sake of brevity.