Dive Into Systems

14.2. Hello Threading! Writing Your First Multithreaded Program

In this section, we examine the ubiquitous POSIX thread library Pthreads. POSIX is an acronym for Portable Operating System Interface. It is an IEEE standard that specifies how UNIX systems look, act, and feel. The POSIX threads API is available on almost all UNIX-like operating systems, each of which meets the standard in its entirety or to some great degree. So, if you write parallel code using POSIX threads on a Linux machine, it will certainly work on other Linux machines, and it will likely work on machines running macOS or other UNIX variants.

Let’s begin by analyzing an example "Hello World" Pthreads program (hellothreads.c). For brevity, we have excluded error handling in the listing, though the downloadable version contains sample error handling.

#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>

/* The "thread function" passed to pthread_create.  Each thread executes this
 * function and terminates when it returns from this function. */
void *HelloWorld(void *id) {

    /* We know the argument is a pointer to a long, so we cast it from a
     * generic (void *) to a (long *). */
    long *myid = (long *) id;

    printf("Hello world! I am thread %ld\n", *myid);

    return NULL; // We don't need our threads to return anything.
}

int main(int argc, char **argv) {
    int i;
    int nthreads; //number of threads
    pthread_t *thread_array; //pointer to future thread array
    long *thread_ids;

    // Read the number of threads to create from the command line.
    if (argc !=2) {
        fprintf(stderr, "usage: %s <n>\n", argv[0]);
        fprintf(stderr, "where <n> is the number of threads\n");
        return 1;
    }
    nthreads = strtol(argv[1], NULL, 10);

    // Allocate space for thread structs and identifiers.
    thread_array = malloc(nthreads * sizeof(pthread_t));
    thread_ids = malloc(nthreads * sizeof(long));

    // Assign each thread an ID and create all the threads.
    for (i = 0; i < nthreads; i++) {
        thread_ids[i] = i;
        pthread_create(&thread_array[i], NULL, HelloWorld, &thread_ids[i]);
    }

    /* Join all the threads. Main will pause in this loop until all threads
     * have returned from the thread function. */
    for (i = 0; i < nthreads; i++) {
        pthread_join(thread_array[i], NULL);
    }

    free(thread_array);
    free(thread_ids);

    return 0;
}

Let’s examine this program in smaller components.

Notice the inclusion of the pthread.h header file, which declares pthread types and functions.
Next, the HelloWorld function defines the thread function that we later pass to pthread_create. A thread function is analogous to a main function for a worker (created) thread — a thread begins execution at the start of its thread function and terminates when it reaches the end. Each thread executes the thread function using its private execution state (i.e., its own stack memory and register values). Note also that the thread function is of type void*. Specifying an anonymous pointer in this context allows programmers to write thread functions that deal with arguments and return values of different types.
Lastly, in the main function, the main thread initializes the program state before creating and joining the worker threads.

14.2.1. Creating and Joining Threads

The program first starts as a single-threaded process. As it executes the main function, it reads the number of threads to create, and it allocates memory for two arrays: thread_array and thread_ids. The thread_array array contains the set of addresses for each thread created. The thread_ids array stores the set of arguments that each thread is passed. In this example, each thread is passed the address of its rank (or ID, represented by thread_ids[i]).

After all the preliminary variables are allocated and initialized, the main thread executes the two major steps of multithreading:

The creation step, in which the main thread spawns one or more worker threads. After being spawned, each worker thread runs within its own execution context concurrently with the other threads and processes on the system.
The join step, in which the main thread waits for all the workers to complete before proceeding as a single-thread process. Joining a thread that has terminated frees the thread’s execution context and resources. Attempting to join a thread that hasn’t terminated blocks the caller until the thread terminates, similar to the semantics of the wait function for processes.

The Pthreads library offers a pthread_create function for creating threads and a pthread_join function for joining them. The pthread_create function has the following signature:

pthread_create(pthread_t *thread, const pthread_attr_t *attr,
               void *thread_function, void *thread_args)

The function takes a pointer to a thread struct (of type pthread_t), a pointer to an attribute struct (normally set to NULL), the name of the function the thread should execute, and the array of arguments to pass to the thread function when it starts.

The Hello World program calls pthread_create in the main function using:

pthread_create(&thread_array[i], NULL, HelloWorld, &thread_ids[i]);

Here:

&thread_array[i] contains the address of thread i. The pthread_create function allocates a pthread_t thread object and stores its address at this location, enabling the programmer to reference the thread later (e.g., when joining it).
NULL specifies that the thread should be created with default attributes. In most programs, it is safe to leave this second parameter as NULL.
HelloWorld names the thread function that the created thread should execute. This function behaves like the "main" function for the thread. For an arbitrary thread function (e.g., function), its prototype must match the form void * function(void *).
&thread_ids[i] specifies the address of the arguments to be passed to thread i. In this case, thread_ids[i] contains a single long representing the thread’s ID. Since the last argument to pthread_create must be a pointer, we pass the address of the thread’s ID.

To generate several threads that execute the HelloWorld thread function, the program assigns each thread a unique ID and creates each thread within a for loop:

for (i = 0; i < nthreads; i++) {
    thread_ids[i] = i;
    pthread_create(&thread_array[i], NULL, HelloWorld, &thread_ids[i]);
}

The OS schedules the execution of each created thread; the user cannot make any assumption on the order in which the threads will execute.

The pthread_join function suspends the execution of its caller until the thread it references terminates. Its signature is:

pthread_join(pthread_t thread, void **return_val)

The pthread_join takes as input a pthread_t struct, indicating which thread to wait on, and an optional pointer argument that specifies where the thread’s return value should be stored.

The Hello World program calls pthread_join in main using:

pthread_join(thread_array[t], NULL);

This line indicates that the main thread must wait on the termination of thread t. Passing NULL as the second argument indicates that the program does not use the thread’s return value.

In the previous program, main calls pthread_join in a loop because all of the worker threads need to terminate before the main function proceeds to clean up memory and terminate the process:

for (i = 0; i < nthreads; i++) {
    pthread_join(thread_array[i], NULL);
}

14.2.2. The Thread Function

In the previous program, each spawned thread prints out Hello world! I am thread n, where n is the thread’s unique id. After the thread prints out its message, it terminates. Let’s take a closer look at the HelloWorld function:

void *HelloWorld(void *id) {
    long *myid = (long*)id;

    printf("Hello world! I am thread %ld\n", *myid);

    return NULL;
}

Recall that pthread_create passes the arguments to the thread function using the thread_args parameter. In the pthread_create function in main, the Hello World program specified that this parameter is in fact the thread’s ID. Note that the parameter to HelloWorld must be declared as a generic or anonymous pointer (void *). The Pthreads library uses void * to make pthread_create more general purpose by not prescribing a parameter type. As a programmer, the void * is mildly inconvenient given that it must be recast before use. Here, we know the parameter is of type long * because that’s what we passed to pthread_create in main. Thus, we can safely cast the value as a long * and dereference the pointer to access the long value. Many parallel programs follow this structure.

Similar to the thread function’s parameter, the Pthreads library avoids prescribing the thread function’s return type by specifying another void * — the programmer is free to return any pointer from the thread function. If the program needs to access the thread’s return value, it can retrieve it via the second argument to pthread_join. In our example, the thread has no need to return a value, so it simply returns a NULL pointer.

14.2.3. Running the Code

The command that follows shows how to use GCC to compile hellothreads.c. Building a Pthreads application requires that the -pthread linker flag be passed to GCC to ensure that the Pthreads functions and types are accessible:

$ gcc -o hellothreads hellothreads.c -pthread

Running the program without a command line argument results in a usage message:

$ ./hellothreads
usage: ./hellothreads <n>
where <n> is the number of threads

Running the program with four threads yields the following output:

$ ./hellothreads 4
Hello world! I am thread 1
Hello world! I am thread 2
Hello world! I am thread 3
Hello world! I am thread 0

Notice that each thread prints its unique ID number. In this run, thread 1’s output displays first, followed by threads 2, 3, and 0. If we run the program again, we may see the output displayed in a different order:

$ ./hellothreads 4
Hello world! I am thread 0
Hello world! I am thread 1
Hello world! I am thread 2
Hello world! I am thread 3

Recall that the operating system’s scheduler determines the thread execution order. From a user’s perspective, the order is effectively random due to being influenced by many factors that vary outside the user’s control (e.g., available system resources, the system receiving input, or OS scheduling). Since all threads are running concurrently with one another and each thread executes a call to printf (which prints to stdout), the first thread that prints to stdout will have its output show up first. Subsequent executions may (or may not) result in different output.

Thread Execution Order

You should never make any assumptions about the order in which threads will execute. If the correctness of your program requires that threads run in a particular order, you must add synchronization to your program to prevent threads from running when they shouldn’t.

14.2.4. Revisiting Scalar Multiplication

Let’s explore how to create a multithreaded implementation of the scalar multiplication program from the previous section. Recall that our general strategy for parallelizing scalar_multiply is to:

Create multiple threads,
Assign each thread a subset of the input array,
Instruct each thread to multiply the elements in its array subset by s.

The following is a thread function that accomplishes this task. Notice that we have moved array, length, and s to the global scope of the program.

long *array; //allocated in main
long length; //set in main (1 billion)
long nthreads; //number of threads
long s; //scalar

void *scalar_multiply(void *id) {
    long *myid = (long *) id;
    int i;

    //assign each thread its own chunk of elements to process
    long chunk = length / nthreads;
    long start = *myid * chunk;
    long end  = start + chunk;
    if (*myid == nthreads - 1) {
        end = length;
    }

    //perform scalar multiplication on assigned chunk
    for (i = start; i < end; i++) {
        array[i] *= s;
    }

    return NULL;
}

Let’s break this down into parts. Recall that the first step is to assign each thread a component of the array. The following lines accomplish this task:

long chunk = length / nthreads;
long start = *myid * chunk;
long end  = start + chunk;

The variable chunk stores the number of elements that each thread is assigned. To ensure that each thread gets roughly the same amount of work, we first set the chunk size to the number of elements divided by the number of threads, or length / nthreads.

Next, we assign each thread a distinct range of elements to process. Each thread computes its range’s start and end index using the chunk size and its unique thread ID.

For example, with four threads (with IDs 0-3) operating over an array with 100 million elements, each thread is responsible for processing a 25 million element chunk. Incorporating the thread ID assigns each thread a unique subset of the input.

The next two lines account for the case in which length is not evenly divisible by the number of threads:

if (*myid == nthreads - 1) {
    end = length;
}

Suppose that we specified three rather than four threads. The nominal chunk size would be 33,333,333 elements, leaving one element unaccounted for. The code in the previous example would assign the remaining element to the last thread.

Creating balanced input

The chunking code just shown is imperfect. In the case where the number of threads does not evenly divide the input, the remainder is assigned to the last thread. Consider a sample run in which the array has 100 elements, and 12 threads are specified. The nominal chunk size would be 8, and the remainder would be 4. With the example code, the first 11 threads will each have 8 assigned elements, whereas the last thread will be assigned 12 elements. Consequently, the last thread performs 50% more work than the other threads. A potentially better way to chunk this example is to have the first 4 threads process 9 elements each, while the last 8 threads process 8 elements each. This will result in better load balancing of the input across the threads.

With an appropriate local start and end index computed, each thread is now ready to perform scalar multiplication on its component of the array. The last portion of the scalar_multiply function accomplishes this:

for (i = start; i < end; i++) {
    array[i] *= s;
}

14.2.5. Improving Scalar Multiplication: Multiple Arguments

A key weakness of the previous implementation is the wide use of global variables. Our original discussion of global variables showed that although useful, global variables should generally be avoided in C. To reduce the number of global variables in the program, one solution is to declare a t_arg struct as follows in the global scope:

struct t_arg {
    int *array; // pointer to shared array
    long length; // num elements in array
    long s; //scaling factor
    long numthreads; // total number of threads
    long id; //  logical thread id
};

Our main function would, in addition to allocating array and setting local variables length, nthreads, and s (our scaling factor), allocate an array of t_arg records:

long nthreads = strtol(argv[1], NULL, 10); //get number of threads
long length = strtol(argv[2], NULL, 10); //get length of array
long s = strtol( argv[3], NULL, 10 ); //get scaling factor

int *array = malloc(length*sizeof(int));

//allocate space for thread structs and identifiers
pthread_t *thread_array = malloc(nthreads * sizeof(pthread_t));
struct t_arg *thread_args = malloc(nthreads * sizeof(struct t_arg));

//Populate thread arguments for all the threads
for (i = 0; i < nthreads; i++){
    thread_args[i].array = array;
    thread_args[i].length = length;
    thread_args[i].s = s;
    thread_args[i].numthreads = nthreads;
    thread_args[i].id = i;
}

Later in main, when pthread_create is called, the thread’s associated t_args struct is passed as an argument:

for (i = 0; i < nthreads; i++){
    pthread_create(&thread_array[i], NULL, scalar_multiply, &thread_args[i]);
}

Lastly, our scalar_multiply function would look like the following:

void * scalar_multiply(void* args) {
    //cast to a struct t_arg from void*
    struct t_arg * myargs = (struct t_arg *) args;

    //extract all variables from struct
    long myid =  myargs->id;
    long length = myargs->length;
    long s = myargs->s;
    long nthreads = myargs->numthreads;
    int * ap = myargs->array; //pointer to array in main

    //code as before
    long chunk = length/nthreads;
    long start = myid * chunk;
    long end  = start + chunk;
    if (myid == nthreads-1) {
        end = length;
    }

    int i;
    for (i = start; i < end; i++) {
        ap[i] *= s;
    }

    return NULL;
}

Implementing this program fully is an exercise we leave to the reader. Please note that error handling has been omitted for the sake of brevity.