14.2. Hello Threading! Writing your First Multi-threaded Program

In this section, we examine the ubiquitous POSIX thread library, or Pthreads. POSIX is an acronym for Portable Operating System Interface, and is an IEEE standard that sets how UNIX systems look, act, and feel. The POSIX threads API is available on almost all UNIX-like operating systems, each of which meets the standard in its entirety or to some great degree. So, if you write parallel code using POSIX threads on a Linux machine, it will certainly work on other Linux machines, and it will likely work on machines running OS X or other UNIX variants.

Let’s begin by analyzing an example "Hello World" Pthreads program (hellothreads.c). For brevity, we have excluded error handling in the listing, though the attached code contains sample error handling.

#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>

/* The "thread function" passed to pthread_create.  Each thread executes this
 * function and terminates when it returns from this function. */
void *HelloWorld(void *id) {

    /* We know the argument is a pointer to a long, so we cast it from a
     * generic (void *) to a (long *). */
    long *myid = (long *) id;

    printf("Hello world! I am thread %ld\n", *myid);

    return NULL; // We don't need our threads to return anything.
}

int main(int argc, char **argv) {
    int i;
    int nthreads; //number of threads
    pthread_t *thread_array; //pointer to future thread array
    long *thread_ids;

    // Read the number of threads to create from the command line.
    if (argc !=2) {
        fprintf(stderr, "usage: %s <n>\n", argv[0]);
        fprintf(stderr, "where <n> is the number of threads\n");
        return 1;
    }
    nthreads = strtol(argv[1], NULL, 10);

    // Allocate space for thread structs and identifiers.
    thread_array = malloc(nthreads * sizeof(pthread_t));
    thread_ids = malloc(nthreads * sizeof(long));

    // Assign each thread an ID and create all the threads.
    for (i = 0; i < nthreads; i++) {
        thread_ids[i] = i;
        pthread_create(&thread_array[i], NULL, HelloWorld, &thread_ids[i]);
    }

    /* Join all the threads. Main will pause in this loop until all the threads
     * have returned from the thread function. */
    for (i = 0; i < nthreads; i++) {
        pthread_join(thread_array[i], NULL);
    }

    free(thread_array);
    free(thread_ids);

    return 0;
}

Let’s examine this program in smaller components:

  • Notice the inclusion of the pthread.h header file, which declares pthread types and functions.

  • Next, the HelloWorld() function defines the thread function that we later pass to pthread_create(). A thread function is analogous to a main() function for a worker (created) thread — a thread begins execution at the start of its thread function and terminates when it reaches the end. Each thread executes the thread function using its private execution state (i.e., its own stack memory and register values). Note also that the thread function is of type void*. Specifying an anonymous pointer in this context allows programmers to write thread functions that deal with arguments and return values of different types.

  • Lastly, in the main() function, the main thread initializes the program state before creating and joining the worker threads.

14.2.1. Creating and Joining Threads

The program first starts as a single-threaded processes. As it executes the main() function, it reads the number of threads to create and allocates memory for two arrays: thread_array and thread_ids. The thread_array array contains the set of addresses for each thread created. The thread_ids array stores the set of arguments that each thread is passed. In this example, each thread is passed the address of its rank (or ID, represented by thread_ids[i]).

After all the preliminary variables are allocated and initialized, the main thread executes the two major steps of multi-threading:

  • The creation step, in which the main thread spawns one or more worker threads. After being spawned, each worker thread runs within its own execution context concurrently with the other threads and processes on the system.

  • The join step, in which the main thread waits for all of the workers to complete before proceeding as a single-thread process. Joining a thread that has terminated frees the thread’s execution context and resources. Attempting to join a thread that hasn’t terminated blocks the caller until the thread terminates, similar to the semantics of the wait() function for processes.

The Pthreads library offers a pthread_create() function for creating threads and a pthread_join() function for joining them. The pthread_create() function has the following signature:

pthread_create(pthread_t *thread, const pthread_attr_t *attr,
               void *thread_function, void *thread_args)

The function takes a pointer to a thread struct (of type pthread_t), an attribute struct (normally set to NULL), the name of the function the thread should execute, and the array of arguments to pass to thread function when it starts.

The Hello World program calls pthread_create() in the main() function using:

pthread_create(&thread_array[i], NULL, HelloWorld, &thread_ids[i]);
  • &thread_array[i] contains the address of thread i. The pthread_create() function allocates a pthread_t thread object and stores its address at this location, enabling the programmer to reference the thread later (e.g., when joining it).

  • NULL specifies that the thread should be created with default attributes. In most programs, it is safe to leave this second parameter as NULL.

  • HelloWorld names the thread function that the created thread should execute. This function behaves like the "main" function for the thread. For an arbitrary thread function (function()), its prototype must match the form void * function(void *).

  • &thread_ids[i] specifies the address of the arguments to be passed to thread i. In this case, thread_ids[i] contains a single long representing the thread’s ID. Since the last argument to pthread_create() must be a pointer, we pass the address of the thread’s ID.

To generate several threads that execute the HelloWorld thread function, the program assigns each thread a unique ID and creates each thread within a for-loop:

for (i = 0; i < nthreads; i++) {
    thread_ids[i] = i;
    pthread_create(&thread_array[i], NULL, HelloWorld, &thread_ids[i]);
}

The operating system schedules the execution of each created thread; the user cannot make any assumption on the order that the threads will execute.

The pthread_join() function suspends the execution of its caller until the the thread its references terminates. Its signature is:

pthread_join(pthread_t thread, void **return_val)

The pthread_join() takes as input a pthread_t struct, indicating which thread to wait on, and an optional pointer argument that specifies where the thread’s return value should be stored.

The Hello World program calls pthread_join() in main() using:

pthread_join(thread_array[t], NULL);

This line indicates that the main thread must wait on the termination of thread t. Passing NULL as the second argument indicates that the program does not use the thread’s return value.

In the above program, main() calls pthread_join() in a loop because all of the worker threads need to terminate before the main() function proceeds to clean up memory and terminate the process:

for (i = 0; i < nthreads; i++) {
    pthread_join(thread_array[i], NULL);
}

14.2.2. The Thread Function

In the program above, each spawned thread prints out Hello World! I am thread n where n is the thread’s unique id. After the thread prints out its message, it terminates. Let’s take a closer look at the HelloWorld() function:

void *HelloWorld(void *id) {
    long *myid = (long*)id;

    printf("Hello world! I am thread %ld\n", *myid);

    return NULL;
}

Recall that pthread_create() passes the arguments to the thread function using the thread_args parameter. In the pthread_create() function in main(), the Hello World program specified that this parameter is in fact the thread’s ID. Note that the parameter to HelloWorld() must be declared as a generic or anonymous pointer (void *). The pthreads library uses void * to make pthread_create() more general-purpose by not prescribing a parameter type. As a programmer, the void * is mildly inconvenient, since it must be recast before use. Here, we know the parameter is of type long * because that’s what we passed to pthread_create() in main(). Thus, we can safely cast the value as a long * and dereference the pointer to access the long value. Many parallel programs follow the above structure.

Similar to the thread function’s parameter, the Pthreads library avoids prescribing the thread function’s return type by specifying another void * — the programmer is free to return any pointer from the thread function. If the program needs to access the thread’s return value, it can retrieve it via the second argument to pthread_join(). In our example, the thread has no need to return a value, so it simply returns a NULL pointer.

14.2.3. Running the Code

The command below shows how to use gcc to compile hellothreads.c. Building a Pthreads application requires that the -lpthread linker flag be passed to gcc to ensure that the Pthreads functions and types are accessible:

$ gcc -o hellothreads hellothreads.c -lpthread

Running the program without a command-line argument results in a usage message:

$ ./hellothreads
usage: ./hellothreads <n>
where <n> is the number of threads

Running the program with four threads yields the following output:

$ ./hellothreads 4
Hello world! I am thread 1
Hello world! I am thread 2
Hello world! I am thread 3
Hello world! I am thread 0

Notice that each thread prints its unique ID number. In this run, thread 1’s output displays first, followed by threads 2, 3 and 0. If we run the program again, we may see the output displayed in a different order:

$ ./hellothreads 4
Hello world! I am thread 0
Hello world! I am thread 1
Hello world! I am thread 2
Hello world! I am thread 3

Recall that the operating system’s scheduler determines the thread execution order. From a user’s perspective, the order is effectively random due to being influenced by many factors that vary outside of a user’s control (e.g., available system resources, the system receiving input, or O.S. scheduling). Since all threads are running concurrently with each other and each thread executes a call to printf() (which prints to stdout), the first thread that prints to stdout will have its output show up first. Subsequent executions may (or may not) result in different output.

Thread Execution Order

You should never make any assumptions about the order in which threads will execute. If the correctness of your program requires that threads run in a particular order, you must add synchronization to your program to prevent threads from running when they shouldn’t.

14.2.4. Revisiting Scalar Multiplication

Let’s explore how to create a multi-threaded implementation of the scalar multiplication program from the previous section. Recall that our general strategy for parallelizing scalar_multiply() is to:

  • Create multiple threads.

  • Assign each thread a subset of the input array.

  • Instruct each thread to multiply the elements in its array subset by s.

Below is a thread function that accomplishes this task. Notice that we have moved array, length and s to the global scope of the program:

long *array; //allocated in main
long length; //set in main (1 billion)
long nthreads; //number of threads
long s; //scalar

void *scalar_multiply(void *id) {
    long myid = (long) id;
    int i;

    //assign each thread its own chunk of elements to process
    long chunk = length / nthreads;
    long start = myid * chunk;
    long end  = start + chunk;
    if (myid == nthreads - 1) {
        end = length;
    }

    //perform scalar multiplication on assigned chunk
    for (i = start; i < end; i++) {
        array[i] *= s;
    }

    return NULL;
}

Let’s break this down into parts. Recall that the first step is to assign each thread a component of the array. The following lines accomplish this task:

long chunk = length / nthreads;
long start = myid * chunk;
long end  = start + chunk;

The variable chunk, stores the number of elements that each thread is assigned. To ensure each thread gets roughly the same amount of work, we first set the chunk size to the number of elements divided by the number of threads, or length / nthreads.

Next, we assign each thread a distinct range of elements to process. Each thread computes its range’s start and end index using the chunk size and its unique thread ID.

For example, with four threads (with IDs 0 .. 3) operating over an array with 100 million elements, each thread is responsible for processing a 25 million element chunk. Incorporating the thread ID assigns each thread a unique subset of the input.

The next two lines account for the case where length is not evenly divisible by the number of threads:

if (myid == nthreads - 1) {
    end = length;
}

Consider if instead of four threads, we specified three. The nominal chunk size would be 33,333,333 elements, leaving one element unaccounted for. The above code would assign the remaining element to the last thread.

Creating balanced input

The chunking code above is imperfect. In the case where the number of threads does not evenly divide the input, the remainder is assigned to the last thread. Consider a sample run where the array has 100 elements and 12 threads are specified. The nominal chunk size would be 8, and the remainder would be 4. With the above code, the first eleven threads will each have 8 assigned elements, while the last thread will be assigned 12 elements. Consequently, the last thread performs 50% more work than the other threads. A potentially better way to chunk this example is to have the first four threads process 9 elements each, while the last eight threads process 8 elements each. This will result in better load balancing of the input across the threads.

With an appropriate local start and end index computed, each thread is now ready to perform scalar multiplication on its component of the array. The last portion of the scalar_multiply() function accomplishes this:

for (i = start; i < end; i++) {
    array[i] *= s;
}

14.2.5. Improving Scalar Multiplication: Multiple Arguments

A key weakness of the above implementation is the wide use of global variables. Our original discussion of global variables showed that while useful, global variables should be generally avoided in C. To reduce the number of global variables in the program, one solution is to declare a t_arg struct as follows in the global scope:

struct t_arg {
    int *array; // pointer to shared array
    long length; // num elements in array
    long s; //scaling factor
    long numthreads; // total number of threads
    long id; //  logical thread id
};

Our main function would, in addition to allocating array and setting local variables length, nthreads, and s (our scaling factor), allocate an array of t_arg records:

long nthreads = strtol(argv[1], NULL, 10); //get number of threads
long length = strtol(argv[2], NULL, 10); //get length of array
long s = strtol( argv[3], NULL, 10 ); //get scaling factor

int *array = malloc(length*sizeof(int));

//allocate space for thread structs and identifiers
pthread_t *thread_array = malloc(nthreads * sizeof(pthread_t));
struct t_arg *thread_args = malloc(nthreads * sizeof(struct t_arg));

//Populate thread arguments for all the threads
for (i = 0; i < nthreads; i++){
    thread_args[i].array = array;
    thread_args[i].length = length;
    thread_args[i].s = s;
    thread_args[i].numthreads = nthreads;
    thread_args[i].id = i;
}

Later in main() when pthread_create() is called, the thread’s associated t_args struct is passed as an argument:

for (i = 0; i < nthreads; i++){
    pthread_create(&thread_array[i], NULL, scalar_multiply, &thread_args[i]);
}

Lastly, our scalar_multiply function would look like the following:

void * scalar_multiply(void* args) {
    //cast to a struct t_arg from void*
    struct t_arg * myargs = (struct t_arg *) args;

    //extract all variables from struct
    long myid =  myargs->id;
    long length = myargs->length;
    long s = myargs->s;
    long nthreads = myargs->numthreads;
    int * ap = myargs->array; //pointer to array in main

    //code as before
    long chunk = length/nthreads;
    long start = myid * chunk;
    long end  = start + chunk;
    if (myid == nthreads-1) {
        end = length;
    }

    int i;
    for (i = start; i < end; i++) {
        ap[i] *= s;
    }

    return NULL;
}

Implementing this program fully is an exercise we leave to the reader below. Please note that error handling is omitted for the sake of brevity.