Dive Into Systems

14.3. Synchronizing Threads

In the examples we’ve looked at thus far, each thread executes without sharing data with any other threads. In the scalar multiplication program, for instance, each element of the array is entirely independent of all the others, making it unnecessary for the threads to share data.

However, a thread’s ability to easily share data with other threads is one of its main features. Recall that all the threads of a multithreaded process share the heap common to the process. In this section, we study the data sharing and protection mechanisms available to threads in detail.

Thread synchronization refers to forcing threads to execute in a particular order. Even though synchronizing threads can add to the runtime of a program, it is often necessary to ensure program correctness. In this section, we primarily discuss how one synchronization construct (a mutex) helps ensure the correctness of a threaded program. We conclude the section with a discussion of some other common synchronization constructs: semaphores, barriers and condition variables.

CountSort

Let’s study a slightly more complicated example called CountSort. The CountSort algorithm is a simple linear (O(N)) sorting algorithm for sorting a known small range of R values, where R is much smaller than N. To illustrate how CountSort works, consider an array A of 15 elements, all of which contain random values between 0 and 9 (10 possible values):

A = [9, 0, 2, 7, 9, 0, 1, 4, 2, 2, 4, 5, 0, 9, 1]

For a particular array, CountSort works as follows:

It counts the frequency of each value in the array.
It overwrites the original array by enumerating each value by its frequency.

After step 1, the frequency of each value is placed in a counts array of length 10, where the value of counts[i] is the frequency of the value i in array A. For example, since there are three elements with value 2 in array A, counts[2] is 3.

The corresponding counts array for the previous example looks like the following:

counts = [3, 2, 3, 0, 2, 1, 0, 1, 0, 3]

Note that the sum of all the elements in the counts array is equal to the length of A, or 15.

Step 2 uses the counts array to overwrite A, using the frequency counts to determine the set of indices in A that store each consecutive value in sorted order. So, since the counts array indicates that there are three elements with value 0 and two elements with value 1 in array A, the first three elements of the final array will be 0, and the next two will be 1.

After running step 2, the final array looks like the following:

A = [0, 0, 0, 1, 1, 2, 2, 2, 4, 4, 5, 7, 9, 9, 9]

Below is a serial implementation of the CountSort algorithm, with the count (step 1) and overwrite (step 2) functions clearly delineated. For brevity, we do not reproduce the whole program here, though you can download the source (countSort.c).

#define MAX 10 //the maximum value of an element. (10 means 0-9)

/*step 1:
 * compute the frequency of all the elements in the input array and store
 * the associated counts of each element in array counts. The elements in the
 * counts array are initialized to zero prior to the call to this function.
*/
void countElems(int *counts, int *array_A, long length) {
    int val, i;
    for (i = 0; i < length; i++) {
      val = array_A[i]; //read the value at index i
      counts[val] = counts[val] + 1; //update corresponding location in counts
    }
}

/* step 2:
 * overwrite the input array (array_A) using the frequencies stored in the
 *  array counts
*/
void writeArray(int *counts, int *array_A) {
    int i, j = 0, amt;

    for (i = 0; i < MAX; i++) { //iterate over the counts array
        amt = counts[i]; //capture frequency of element i
        while (amt > 0) { //while all values aren't written
            array_A[j] = i; //replace value at index j of array_A with i
            j++; //go to next position in array_A
            amt--; //decrease the amount written by 1
        }
    }
}

/* main function:
 * gets array length from command line args, allocates a random array of that
 * size, allocates the counts array, the executes step 1 of the CountSort
 * algorithm (countsElem) followed by step 2 (writeArray).
*/
int main( int argc, char **argv ) {
    //code ommitted for brevity -- download source to view full file

    srand(10); //use of static seed ensures the output is the same every run

    long length = strtol( argv[1], NULL, 10 );
    int verbose = atoi(argv[2]);

    //generate random array of elements of specified length
    int *array = malloc(length * sizeof(int));
    genRandomArray(array, length);

    //print unsorted array (commented out)
    //printArray(array, length);

    //allocate counts array and initializes all elements to zero.
    int counts[MAX] = {0};

    countElems(counts, array, length); //calls step 1
    writeArray(counts, array); //calls step2

    //print sorted array (commented out)
    //printArray(array, length);

    free(array); //free memory

    return 0;
}

Running this program on an array of size 15 yields the following output:

$ ./countSort 15 1
array before sort:
5 8 8 5 8 7 5 1 7 7 3 3 8 3 4
result after sort:
1 3 3 3 4 5 5 5 7 7 7 8 8 8 8

The second parameter to this program is a verbose flag, which indicates whether the program prints output. This is a useful option for larger arrays for which we may want to run the program but not necessarily print out the output.

Parallelizing countElems: An Initial Attempt

CountSort consists of two primary steps, each of which benefits from being parallelized. In the remainder of the chapter, we primarily concentrate on the parallelization of step 1, or the countElems function. Parallelizing the writeArray function is left as an exercise for the reader.

The code block that follows depicts a first attempt at creating a threaded countElems function. Parts of the code (argument parsing, error handling) are omitted in this example for the sake of brevity, but the full source can be downloaded here (countElems_p.c). In the code that follows, each thread attempts to count the frequency of the array elements in its assigned component of the global array and updates a global count array with the discovered counts:

/*parallel version of step 1 (first cut) of CountSort algorithm:
 * extracts arguments from args value
 * calculates the portion of the array that thread is responsible for counting
 * computes the frequency of all the elements in assigned component and stores
 * the associated counts of each element in counts array
*/
void *countElems( void *args ) {
    struct t_arg * myargs = (struct t_arg *)args;
    //extract arguments (omitted for brevity)
    int *array = myargs->ap;
    long *counts = myargs->countp;
    //... (get nthreads, length, myid)

    //assign work to the thread
    long chunk = length / nthreads; //nominal chunk size
    long start = myid * chunk;
    long end = (myid + 1) * chunk;
    long val;
    if (myid == nthreads-1) {
        end = length;
    }

    long i;
    //heart of the program
    for (i = start; i < end; i++) {
        val = array[i];
        counts[val] = counts[val] + 1;
    }

    return NULL;
}

The main function looks nearly identical to our earlier sample programs:

int main(int argc, char **argv) {

    if (argc != 4) {
        //print out usage info (ommitted for brevity)
        return 1;
    }

    srand(10); //static seed to assist in correctness check

    //parse command line arguments
    long t;
    long length = strtol(argv[1], NULL, 10);
    int verbose = atoi(argv[2]);
    long nthreads = strtol(argv[3], NULL, 10);

    //generate random array of elements of specified length
    int *array = malloc(length * sizeof(int));
    genRandomArray(array, length);

    //specify counts array and initialize all elements to zero
    long counts[MAX] = {0};

    //allocate threads and args array
    pthread_t *thread_array; //pointer to future thread array
    thread_array = malloc(nthreads * sizeof(pthread_t)); //allocate the array
    struct t_arg *thread_args = malloc( nthreads * sizeof(struct t_arg) );

    //fill thread array with parameters
    for (t = 0; t < nthreads; t++) {
        //ommitted for brevity...
    }

    for (t = 0; t < nthreads; t++) {
        pthread_create(&thread_array[t], NULL, countElems, &thread_args[t]);
    }

    for (t = 0; t < nthreads; t++) {
        pthread_join(thread_array[t], NULL);
    }

    free(thread_array);
    free(array);

    if (verbose) {
        printf("Counts array:\n");
        printCounts(counts);
    }
    return 0;
}

For reproducibility purposes, the random number generator is seeded with a static value (10) to ensure that array (and therefore counts) always contains the same set of numbers. An additional function (printCounts) prints out the contents of the global counts array. The expectation is that, regardless of the number of threads used, the contents of the counts array should always be the same. For brevity, error handling has been removed from the listing.

Compiling the program and running it with one, two, and four threads over 10 million elements produces the following:

$ gcc -o countElems_p countElems_p.c -pthread

$./countElems_p 10000000 1 1
Counts array:
999170 1001044 999908 1000431 999998 1001479 999709 997250 1000804 1000207

$./countElems_p 10000000 1 2
Counts array:
661756 661977 657828 658479 657913 659308 658561 656879 658070 657276

$./countElems_p 10000000 1 4
Counts array:
579846 580814 580122 579772 582509 582713 582518 580917 581963 581094

Note that the printed results change significantly on each run. In particular, they seem to change as we vary the number of threads! This should not happen, since our use of the static seed guarantees the same set of numbers every run. These results contradict one of the cardinal rules for threaded programs: the output of a program should be correct and consistent regardless of the number of threads used.

Since our first attempt at parallelizing countElems doesn’t seem to be working, let’s delve deeper into what this program is doing and examine how we might fix it.

Data Races

To understand what’s going on, let’s consider an example run with two threads on two separate cores of a multicore system. Recall that the execution of any thread can be preempted at any time by the OS, which means that each thread could be running different instructions of a particular function at any given time (or possibly the same instruction). Table 1 shows one possible path of execution through the countElems function. To better illustrate what is going on, we translated the line counts[val] = counts[val] + 1 into the following sequence of equivalent instructions:

Read counts[val] and place into a register.
Modify the register by incrementing it by one.
Write the contents of the register to counts[val].

This is known as the read-modify-write pattern. In the example shown in Table 1, each thread executes on a separate core (Thread 0 on Core 0, Thread 1 on Core 1). We start inspecting the execution of the process at time step i, where both threads have a val of 1.

Table 1. A Possible Execution Sequence of Two Threads Running countElems
Time	Thread 0	Thread 1
i	Read counts[1] and place into Core 0’s register	…
i+1	Increment register by 1	Read counts[1] and place into Core 1’s register
i+2	Overwrite counts[1] with contents of register	Increment register by 1
i+3	…	Overwrite counts[1] with contents of register

Suppose that, prior to the execution sequence in Table 1, counts[1] contains the value 60. In time step i, Thread 0 reads counts[1] and places the value 60 in Core 0’s register. In time step i+1, while Thread 0 increments Core 0’s register by one, the current value in counts[1] (60) is read into Core 1’s register by Thread 1. In time step i+2, Thread 0 updates counts[1] with the value 61 while Thread 1 increments the value stored in its local register (60) by one. The end result is that during time step i+3, the value counts[1] is overwritten by Thread 1 with the value 61, not 62 as we would expect! This causes counts[1] to essentially "lose" an increment!

We refer to the scenario in which two threads attempt to write to the same location in memory as a data race condition. More generally, a race condition refers to any scenario in which the simultaneous execution of two operations gives an incorrect result. Note that a simultaneous read of the counts[1] location would not in and of itself constitute a race condition, because values can generally read alone from memory without issue. It was the combination of this step with the writes to counts[1] that caused the incorrect result. This read-modify-write pattern is a common source of a particular type of race condition, called a data race, in most threaded programs. In our discussion of race conditions and how to fix them, we focus on data races.

Atomic operations

An operation is defined as being atomic if a thread perceives it as executing without interruption (in other words, as an "all or nothing" action). In some libraries, a keyword or type is used to specify that a block of computation should be treated as being atomic. In the previous example, the line counts[val] = counts[val] + 1 (even if written as counts[val]++) is not atomic, because this line actually corresponds to several instructions at the machine level. A synchronization construct like mutual exclusion is needed to ensure that there are no data races. In general, all operations should be assumed to be nonatomic unless mutual exclusion is explicitly enforced.

Keep in mind that not all execution sequences of the two threads cause a race condition. Consider the sample execution sequence of Threads 0 and 1 in Table 2.

Table 2. Another Possible Execution Sequence of Two Threads Running countElems
Time	Thread 0	Thread 1
i	Read counts[1] and place into Core 0’s register	…
i+1	Increment register by 1	…
i+2	Overwrite counts[1] with contents of register	…
i+3	…	Read counts[1] and place into Core 1’s register
i+4	…	Increment register by 1
i+5	…	Overwrite counts[1] with contents of register

In this execution sequence, Thread 1 does not read from counts[1] until after Thread 0 updates it with its new value (61). The end result is that Thread 1 reads the value 61 from counts[1] and places it into Core 1’s register during time step i+3, and writes the value 62 to counts[1] in time step i+5.

To fix a data race, we must first isolate the critical section, or the subset of code that must execute atomically (in isolation) to ensure correct behavior. In threaded programs, blocks of code that update a shared resource are typically identified to be critical sections.

In the countElems function, updates to the counts array should be put in a critical section to ensure that values are not lost due to multiple threads updating the same location in memory:

long i;
for (i = start; i < end; i++) {
    val = array[i];
    counts[val] = counts[val] + 1; //this line needs to be protected
}

Since the fundamental problem in countElems is the simultaneous access of counts by multiple threads, a mechanism is needed to ensure that only one thread executes within the critical section at a time. Using a synchronization construct (like a mutex, which is covered in the next section) will force the threads to enter the critical section sequentially.