14.3.1. Mutual Exclusion

What is the mutex? The answer is out there, and it’s looking for you, and it will find you if you want it to.

~Trinity, explaining mutexes to Neo (Apologies to The Matrix (The Mutex?))

To fix the data race, let’s use a synchronization construct known as a mutual exclusion lock, or mutex. Mutual exclusion locks are a type of synchronization primitive that ensures that only one thread enters and executes the code inside of the critical section at any given time.

Before using a mutex, a program must first:

  1. Declare the mutex in memory that’s shared by threads (often as a global variable).

  2. Initialize the mutex before the threads need to use it (typically in the main function).

The pthreads library defines a pthread_mutex_t type for mutexes. To declare a mutex variable, add the line:

pthread_mutex_t mutex;

To initialize the mutex, use the pthread_mutex_init() function, which takes the address of a mutex and an attribute structure, typically set to NULL:

pthread_mutex_init(&mutex, NULL);

When the mutex is no longer needed (typically at the end of the main() function, after pthread_join()) a program should release the mutex structure by invoking the pthread_mutex_destroy() function:

pthread_mutex_destroy(&mutex);

The Mutex: Locked and Loaded

The initial state of a mutex is unlocked, meaning it’s immediately usable by any thread. To enter a critical section, a thread must first acquire a lock. This is accomplished with a call to the pthread_mutex_lock() function. Once a thread has the lock, no other thread can enter the critical section until the thread with the lock releases it. If another thread calls pthread_mutex_lock() and the mutex is already locked, the thread will block (or wait) until the mutex becomes available. Recall that blocking implies that the thread will not be scheduled to use the CPU until the condition it’s waiting for (the mutex being available) becomes true.

When a thread exits the critical section, it must call the pthread_mutex_unlock() function to release the mutex, making it available for another thread. Thus, at most one thread may acquire the lock and enter the critical section at a time, which prevents multiple threads from racing to read and update shared variables.

Having declared and initialized a mutex, the next question is where the lock and unlock functions should be placed to best enforce the critical section. Here is an initial attempt at augmenting the countElems() function with a mutex (full source can be downloaded here: countElems_p_v2.c)

pthread_mutex_t mutex; //global declaration of mutex, initialized in main()

/*parallel version of step 1 of CountSort algorithm (attempt 1 with mutexes):
 * extracts arguments from args value
 * calculates the component of the array that thread is responsible for counting
 * computes the frequency of all the elements in assigned component and stores
 * the associated counts of each element in counts array
*/
void *countElems( void *args ) {
    //extract arguments
    //ommitted for brevity
    int *array = myargs->ap;
    long *counts = myargs->countp;

    //assign work to the thread
    long chunk = length / nthreads; //nominal chunk size
    long start = myid * chunk;
    long end = (myid + 1) * chunk;
    long val;
    if (myid == nthreads - 1) {
        end = length;
    }
    long i;

    //heart of the program
    pthread_mutex_lock(&mutex); //acquire the mutex lock
    for (i = start; i < end; i++) {
        val = array[i];
        counts[val] = counts[val] + 1;
    }
    pthread_mutex_unlock(&mutex); //release the mutex lock

    return NULL;
}

The mutex initialize and destroy functions are placed in main around the thread creation and join functions:

//code snippet from main():

pthread_mutex_init(&mutex, NULL); //initialize the mutex

for (t = 0; t < nthreads; t++) {
    pthread_create( &thread_array[t], NULL, countElems, &thread_args[t] );
}

for (t = 0; t < nthreads; t++) {
    pthread_join(thread_array[t], NULL);
}
pthread_mutex_destroy(&mutex); //destroy (free) the mutex

Let’s recompile and run this new program while varying the number of threads:

$ ./countElems_p_v2 10000000 1 1
Counts array:
999170 1001044 999908 1000431 999998 1001479 999709 997250 1000804 1000207

$ ./countElems_p_v2 10000000 1 2
Counts array:
999170 1001044 999908 1000431 999998 1001479 999709 997250 1000804 1000207

$ ./countElems_p_v2 10000000 1 4
Counts array:
999170 1001044 999908 1000431 999998 1001479 999709 997250 1000804 1000207

Excellent, the output is finally consistent regardless of the number of threads used!

Recall that another primary goal of threading is to reduce the run time of a program as the number of threads increase (i.e. speed up program execution). Let’s benchmark the performance of the countElems() function. While it may be tempting to use a command line utility like time -p, recall that invoking time -p measures the wall time of the entire program (including the generation of random elements) and not just the running of the countElems() function. In this case, it is better to use a system call like gettimeofday(), which allows a user to accurately measure the wall time of a particular section of code. Benchmarking countElems() on 100 million elements yields the following run times:

$ ./countElems_p_v2 100000000 0 1
Time for Step 1 is 0.368126 s

$ ./countElems_p_v2 100000000 0 2
Time for Step 1 is 0.438357 s

$ ./countElems_p_v2 100000000 0 4
Time for Step 1 is 0.519913 s

Adding more threads causes the program to get slower! This goes against the goal of making programs faster with threads.

To understand what is going on, consider where the locks are placed in the countsElems function:

//code snippet from the countElems function from earlier
//the heart of the program
pthread_mutex_lock(&mutex); //acquire the mutex lock
for (i = start; i < end; i++){
    val = array[i];
    counts[val] = counts[val] + 1;
}
pthread_mutex_unlock(&mutex); //release the mutex lock

In this example, we placed the lock around the entirety of the for loop. While this placement solves the correctness problems, it’s an extremely poor decision from a performance perspective — the critical section now encompasses the entire loop body. Placing locks in this manner guarantees that only one thread can execute the loop at a time, effectively serializing the program!

The Mutex: Reloaded

Let’s try another approach and place the mutex locking and unlocking functions within every iteration of the loop:

/*modified code snippet of countElems function:
 *locks are now placed INSIDE the for loop!
*/
//the heart of the program
for (i = start; i < end; i++) {
    val = array[i];
    pthread_mutex_lock(&m); //acquire the mutex lock
    counts[val] = counts[val] + 1;
    pthread_mutex_unlock(&m); //release the mutex lock
}

This may initially look like a better solution because each thread can enter the loop in parallel, serializing only when reaching the lock. The critical section is very small, encompassing only the line counts[val] = counts[val] + 1.

Let’s first perform a correctness check on this version of the program:

$ ./countElems_p_v3 10000000 1 1
Counts array:
999170 1001044 999908 1000431 999998 1001479 999709 997250 1000804 1000207

$ ./countElems_p_v3 10000000 1 2
Counts array:
999170 1001044 999908 1000431 999998 1001479 999709 997250 1000804 1000207

$ ./countElems_p_v3 10000000 1 4
Counts array:
999170 1001044 999908 1000431 999998 1001479 999709 997250 1000804 1000207

So far so good. This version of the program also produces consistent output regardless of the number of threads employed.

Now, let’s look at performance:

$ ./countElems_p_v3 100000000 0 1
Time for Step 1 is 1.92225 s

$ ./countElems_p_v3 100000000 0 2
Time for Step 1 is 10.9704 s

$ ./countElems_p_v3 100000000 0 4
Time for Step 1 is 9.13662 s

Running this version of the code, yields (amazingly enough) a significantly slower run time!

As it turns out, locking and unlocking a mutex are expensive operations. Recall what was covered in the discussion on function call optimizations: calling a function repeatedly (and needlessly) in a loop can be a major cause of slowdown in a program. In our prior use of mutexes, each thread locks and unlocks the mutex exactly once. In the current solution, each thread locks and unlocks the mutex n/t times, where n is the size of the array, t is the number of threads, and n/t is the size of the array component assigned to each particular thread. As a result, the cost of the additional mutex operations slows down the loop’s execution considerably.

The Mutex: Revisited

In addition to protecting the critical section to achieve correct behavior, an ideal solution would:

  • Use the lock and unlock functions as little as possible.

  • Reduce the critical section to the smallest possible size.

The original implementation satisfies the first bullet, whereas the second implementation tries to accomplish the second bullet. At first glance, it appears that the two bullets are incompatible with each other. Is there a way to actually accomplish both (and while we are at it, speed up the execution of our program)?

For the next attempt, each thread maintains a private, local array of counts on its stack. Because the array is local to each thread, a thread can access it without locking — there’s no risk of a race condition on data that isn’t shared between threads. Each thread processes its assigned subset of the shared array and populates its local counts array. After counting up all the values within its subset, each thread:

  1. Locks the shared mutex (entering a critical section).

  2. Adds the values from its local counts array to the shared counts array.

  3. Unlocks the shared mutex (exiting the critical section).

Restricting each thread to update the shared counts array only once significantly reduces the contention for shared variables and minimizes expensive mutex operations.

Below is our revised countElems function. The full source code for this final program can be accessed here: (countElems_p_v3.c)

/*parallel version of step 1 of CountSort algorithm (final attempt with mutexes):
 * extracts arguments from args value
 * calculates the component of the array that thread is responsible for counting
 * computes the frequency of all the elements in assigned component and stores
 * the associated counts of each element in counts array
*/
void *countElems( void *args ) {
    //extract arguments
    //ommitted for brevity
    int *array = myargs->ap;
    long *counts = myargs->countp;

    //local declaration of counts array, which initializes every element to zero.
    long local_counts[MAX] = {0};

    //assign work to the thread
    long chunk = length / nthreads; //nominal chunk size
    long start = myid * chunk;
    long end = (myid + 1) * chunk;
    long val;
    if (myid == nthreads-1)
        end = length;

    long i;

    //heart of the program
    for (i = start; i < end; i++) {
        val = array[i];
        local_counts[val] = local_counts[val] + 1; //updates local counts array
    }

    //update to global counts array
    pthread_mutex_lock(&mutex); //acquire the mutex lock
    for (i = 0; i < MAX; i++) {
        counts[i] += local_counts[i];
    }
    pthread_mutex_unlock(&mutex); //release the mutex lock

    return NULL;
}

This version has a few additional features:

  • The presence of local_counts, an array that is private to the scope of each thread (i.e., allocated in the thread’s stack). Like counts, local_counts contains MAX elements, since MAX is the maximum value of elements in our input array.

  • Each thread makes updates to local_counts at its own pace, without any contention for shared variables.

  • A single call to pthread_mutex_lock() protects each thread’s update to the global counts array, which happens only once at the end of each thread’s execution.

In this manner, we reduce the time each thread spends in a critical section to just updating the shared counts array. Even though only one thread can enter the critical section at a time, the time each thread spends there is proportionate to MAX, not n, the length of the global array. Since MAX is much less than n, we should see an improvement in performance.

Let’s now benchmark this version of our code:

$ ./countElems_p_v3 100000000 0 1
Time for Step 1 is 0.334574 s

$ ./countElems_p_v3 100000000 0 2
Time for Step 1 is 0.209347 s

$ ./countElems_p_v3 100000000 0 4
Time for Step 1 is 0.130745 s

Wow, what a difference! Our program not only computes the correct answers, but also executes faster as we increase the number of threads.

The lesson to take away here is this: to efficiently minimize a critical section, use local variables to collect intermediate values. After the hard work requiring parallelization is over, use a mutex to safely update any shared variable(s).

Deadlock

In some programs, waiting threads have dependencies on each other. A situation called deadlock can arise when multiple synchronization constructs like mutexes are incorrectly applied. A deadlocked thread is blocked from execution by another thread, which itself is blocked on a blocked thread. Grid-lock (where cars in all directions cannot move forward due to being blocked by other cars), is a common real-world example of deadlock that occurs in busy city intersections.

To illustrate a deadlock scenario in code, let’s consider an example where multi-threading is used to implement a banking application. Each user’s account is defined by a balance and its own mutex (ensuring that no race conditions can occur when updating the balance):

struct account {
    pthread_mutex_t lock;
    int balance;
};

Consider the following naive implementation of a Transfer() function, where money moves from one bank account to another:

void *Transfer(void *args){
    //argument passing removed to increase readability
    //...

    pthread_mutex_lock(&fromAcct->lock);
    pthread_mutex_lock(&toAcct->lock);

    fromAcct->balance -= amt;
    toAcct->balance += amt;

    pthread_mutex_unlock(&fromAcct->lock);
    pthread_mutex_unlock(&toAcct->lock);

    return NULL;
}

Suppose Threads 0 and 1 are executing concurrently and represent users A and B respectively. Now consider the situation where A and B want to transfer money to each other: A wants to transfer 20 bucks to B, while B wants to transfer 40 bucks to A.

Two threads deadlocked with each other
Figure 1. An example of deadlock

In the path of execution highlighted by Figure 1, both threads concurrently execute the Transfer() function. Thread 0 acquires the lock of acctA while Thread 1 acquires the lock of acctB. Now consider what happens. To continue executing, Thread 0 needs to acquire the lock on acctB, which Thread 1 holds. Likewise, Thread 1 needs to acquire the lock on acctA to continue executing, which Thread 0 holds. Since both threads are blocked on each other, they are in deadlock.

While the OS provides some protection against deadlock, programmers should be mindful about writing code that increases the likelihood of deadlock. For example, the above scenario could have been avoided by re-arranging the locks so that each lock/unlock pair surrounds only the balance update statement associated with it:

void *Transfer(void *args){
    //argument passing removed to increase readability
    //...

    pthread_mutex_lock(&fromAcct->lock);
    fromAcct->balance -= amt;
    pthread_mutex_unlock(&fromAcct->lock);

    pthread_mutex_lock(&toAcct->lock);
    toAcct->balance += amt;
    pthread_mutex_unlock(&toAcct->lock);

    return NULL;
}

Deadlock is not a situation that is unique to threads. Processes (especially those that are communicating with each other) can deadlock with each other. Programmers should be mindful of the synchronization primitives they use, and the consequences of using them incorrectly.