14.5. Cache Coherence and False Sharing

Multicore caches can have profound implications on a multi-threaded program’s performance. First however, let’s quickly review some of the basic concepts related to cache design:

  • Data/instructions are not transported individually to the cache. Instead, data is transferred in blocks, and block sizes tend to get larger at lower levels of the memory hierarchy.

  • Each cache is organized into a series of sets, with each set having a number of lines. Each line holds a single block of data.

  • The individual bits of a memory address are used to determine which set, tag, and block offset of the cache to write a block of data to.

  • A cache hit occurs when the desired data block exists in the cache. Otherwise, a cache miss occurs, and a lookup is performed on the next lower level of the memory hierarchy (which can be cache or main memory).

  • The valid bit indicates if a block at a particular line in the cache is safe to use. If the valid bit is set to 0, the data block at that line cannot be used (e.g. they could contain data from an exited process).

  • Information is written to cache/memory based on two main strategies. In the write-through strategy, the data is written to cache and main memory simultaneously. In the write-back strategy, data is only written to cache and gets written to lower levels in the hierarchy once the block is evicted from the cache.

14.5.1. Caches on multicore systems

Recall that in shared memory architectures, each core can have its own cache, and that multiple cores can share a common cache. Figure 1 depicts an example dual-core CPU. While each core has its own local L1 cache, the cores share a common (or unified) L2 cache.

dual core processor with separate L1 caches and shared L2 cache
Figure 1. An example dual-core CPU with separate L1 caches and a shared L2 cache.

Multiple threads in a single executable may execute separate functions. Let’s consider one such example. Assume the dual-core processor in Figure 1, where each core is busy executing separate threads concurrently. The thread assigned to Core 0 has a local variable x, while the thread executing on Core 1 has a local variable y, and both threads have shared access to a global variable g. Table 1 shows one possible path of execution.

Table 1. Problematic data sharing due to caching
Time Core 0 Core 1

0

g = 5

(other work)

1

(other work)

y = g*4

2

x += g

y += g*2

Suppose the initial value of g is 10, and the initial values of x and y are 0 respectively. What is the final value of y at the end of this sequence of operations? This is a very difficult question to answer, since there are at least three stored values of g: one in Core 0’s L1 cache, one in Core 1’s L1 cache, and a separate copy of g stored in the shared L2 cache.

A problematic update to the caches.
Figure 2. A problematic update to caches.

Figure 2 shows one possible result after the sequence of operations in Table 1 completes. Suppose the L1 caches implement a write-back policy. When the thread executing on Core 0 writes the value 5 to g, it only updates the value of g in Core 0’s L1 cache. The value of g in Core 1’s L1 cache still remains 10, as does the copy in the shared L2 cache. Even if a write-through policy is implemented, there is no guarantee that the copy of g stored in Core 1’s L1 cache gets updated! In this case, y will have the final value of 60.

Cache coherence is a strategy of ensuring that each cache maintains a consistent view of shared memory. Cache coherence policies either invalidate or update cached copies of shared values in other caches when a write to the shared data value is made in one cache. For example, snoopy caches "snoop" on the bus for possible write signals. If a write to a shared variable is detected, the line containing that variable is then invalidated in all caches.

Employing a snoopy cache for the scenario above would yield the following sequence of operations:

  • During time step 0, 5 is written to shared variable g in Core 0’s L1 cache. The snoopy cache sees the write to g and invalidates the line containing g in Core 1’s L1 cache, and the shared L2 cache. As a result, the write to g results in an update to Core 0’s L1 cache and the shared L2 cache. Once the value 5 is written to the shared L2 cache, the valid bit associated with the line containing g in the L2 cache is set to 1.

  • During time step 1, an access to g is attempted in Core 1’s L1 cache. However, since the line that g is on has a valid bit set to 0 (indicating that it was invalidated), the L2 cache is checked. The line in L2 associated with g has a valid bit set to 1, so the block containing the new value of g is swapped into Core 1’s L1 cache. As a result, the block associated with g in Core 1 now contains the value 5. The valid bit associated with the line containing g is then set to 1. When the computation y = g * 4 is performed, y is set to 20.

  • When y is updated again during time step 2 by Core 1, a lookup on g shows that the cache line is still valid (recall that only writes invalidate the cache line!). Therefore the value 10 is added to y, yielding the final desired answer of 30.

14.5.2. False Sharing

While cache coherence guarantees correctness, it can potentially harm performance. Recall that when the thread updates g on Core 0, the snoopy cache does not invalidate only g, but the entire cache line that g is a part of.

Consider the initial attempt at parallelizing the countElems() function of the CountSort algorithm. For convenience, the function is reproduced below:

/*parallel version of step 1 (first cut) of CountSort algorithm:
 * extracts arguments from args value
 * calculates the component of the array that thread is responsible for counting
 * computes the frequency of all the elements in assigned component and stores
 * the associated counts of each element in counts array
*/
void *countElems( void *args ){
    //extract arguments
    //ommitted for brevity
    int *array = myargs->ap;
    long *counts = myargs->countp;

    //assign work to the thread
    //compute chunk, start, and end
    //ommited for brevity

    long i;
    //heart of the program
    for (i = start; i < end; i++){
        val = array[i];
        counts[val] = counts[val] + 1;
    }

    return NULL;
}

In our previous discussion of this function, we pointed out how data races can cause the counts array to not populate with the correct set of counts. Let’s see what happens if we attempt to time this function. We add timing code to main() using getimeofday() in the exact manner as shown in countElems_p_v3.c. Benchmarking the initial version of countsElem as shown above on 100 million elements yields the following times:

$ ./countElems_p 100000000 0 1
Time for Step 1 is 0.336239 s

$ ./countElems_p 100000000 0 2
Time for Step 1 is 0.799464 s

$ ./countElems_p 100000000 0 4
Time for Step 1 is 0.767003 s

Even without any synchronization constructs, this version of the program still gets slower as the number of threads increases!

To understand what is going on, let’s revisit counts array. The counts array holds the frequency of the occurrence of each number in our input array. The maximum value is determined by the variable MAX. In our example program, MAX is set to 10. In other words, the counts array takes up 40 bytes of space.

Now, let’s learn a bit more about the cache on our system. Type the following command to display cache information for a core (cpu0).

$ ls /sys/devices/system/cpu/cpu0/cache
index0/  index1/  index2/  index3/  power/  uevent

The output indicates that the core indicated by cpu0 has 4 caches (index0 - index3). To get the details of each cache (including the type of cache it is and what it is used for), examine the index directory’s type and level files:

$ cat /sys/devices/system/cpu/cpu0/cache/index*/type
Data
Instruction
Unified
Unified
$ cat /sys/devices/system/cpu/cpu0/cache/index*/level
1
1
2
3

The type output tells us that the core referred to by cpu0 has separate data and instruction caches, and two unified (shared) caches. Correlating the level output with the type output reveals that the data and instruction caches are both L1 caches, while the unified caches are L2 and L3 caches respectively.

Looking closer at the level 1 data cache and its corresponding line size shows:

$ ls /sys/devices/system/cpu/cpu0/cache/index0
coherency_line_size      power            type
level                    shared_cpu_list  uevent
number_of_sets           shared_cpu_map   ways_of_associativity
physical_line_partition  size

$ cat /sys/devices/system/cpu/cpu0/cache/coherency_line_size
64

From this information, it can be gleaned that the L1 cache line size for the machine is 64 bytes. In other words, the 40-byte counts array fits within one cache line.

Recall that with cache coherence, every time a program updates a shared variable, the entire cache line in other caches storing the variable is invalidated. Let’s consider what happens when two threads execute the above function. Here’s one possible path of execution (assume that each thread is assigned to a separate core, and the variable x is local to each thread):

Table 2. A possible execution sequence of two threads running countElems
Time Thread 0 Thread 1

i

reads array[x] (1)

…​

i+1

increments counts[1] (invalidates cache line)

reads array[x] (4)

i+2

reads array[x] (6)

increments counts[4] (invalidates cache line)

i+3

increments counts[6] (invalidates cache line)

reads array[x] (2)

i+4

reads array[x] (3)

increments counts[2] (invalidates cache line)

i+5

increments counts[3] (invalidates cache line)

…​

  • During time step i, thread 0 reads the value at array[x] in its part of the array, which is a 1 in this example.

  • During time steps i+1 to i+5, each thread reads a value from array[x]. Note that each thread is looking at different components of the array. Not only that, each read of array in our sample execution yields unique values (so no race conditions in this sample execution sequence!). After reading the value from array[x], each thread increments the associated value in counts.

  • Recall that the counts array fits on a single cache line in our L1 cache. As a result, every write to counts invalidates the entire line in every other L1 cache.

  • The end result is that, despite updating different memory locations in counts, any cache line containing counts is invalidated with every update to counts!

The invalidation forces all L1 caches to update the line with a "valid" version from L2. The repeated invalidation and overwriting of lines from the L1 cache is an example of thrashing, where repeated conflicts in the cache cause a series of misses.

The addition of more cores makes the problem worse, since now more L1 caches are invalidating the line. As a result, adding additional threads slows down the run time, despite the fact that each thread is accessing different elements of the counts array! This is an example of false sharing, or an illusion that individual elements are being shared by multiple cores. In the example above, it appears all the cores are accessing the same elements of counts, even though this is not the case.

14.5.3. Fixing False Sharing

One way to fix an instance of false sharing is to pad the array (in our case counts) with additional elements so that it doesn’t fit in a single cache line. However, padding can waste memory, and may not eliminate the problem from all architectures (consider the scenario where two different machines have different L1 cache sizes). In most cases, writing code to support different cache sizes is generally not worth the gain in performance.

A better solution is to have threads write to local storage whenever possible. Local storage in this context refers to the memory that is local to a thread. The following solution reduces false sharing by choosing to perform updates to locally declared version of counts called local_counts.

Let’s revisit the final version of our countElems() function (reproduced from countElems_p_v3.c):

/*parallel version of step 1 of CountSort algorithm (final attempt with mutexes):
 * extracts arguments from args value
 * calculates the component of the array that thread is responsible for counting
 * computes the frequency of all the elements in assigned component and stores
 * the associated counts of each element in counts array
*/
void *countElems( void *args ){
    //extract arguments
    //omitted for brevity
    int *array = myargs->ap;
    long *counts = myargs->countp;

    long local_counts[MAX] = {0}; //local declaration of counts array

    //assign work to the thread
    //compute chunk, start, and end values (omitted for brevity)

    long i;

    //heart of the program
    for (i = start; i < end; i++){
        val = array[i];
        local_counts[val] = local_counts[val] + 1; //updates local counts array
    }

    //update to global counts array
    pthread_mutex_lock(&mutex); //acquire the mutex lock
    for (i = 0; i < MAX; i++){
        counts[i] += local_counts[i];
    }
    pthread_mutex_unlock(&mutex); //release the mutex lock

    return NULL;
}

The use of local_counts to accumulate frequencies in lieu of counts is the major source of reduction of false sharing in this example:

for (i = start; i < end; i++){
    val = array[i];
    local_counts[val] = local_counts[val] + 1; //updates local counts array
}

Since cache coherence is meant to maintain a consistent view of shared memory, the invalidations only trigger on writes to shared values in memory. Since local_counts is not shared amongst the different threads, a write to it will not invalidate its associated cache line.

In the last component of the code, the mutex enforces correctness by ensuring that only one thread updates the shared counts array at a time:

//update to global counts array
pthread_mutex_lock(&mutex); //acquire the mutex lock
for (i = 0; i < MAX; i++){
    counts[i] += local_counts[i];
}
pthread_mutex_unlock(&mutex); //release the mutex lock

Since counts is located on a single cache line, it will still get invalidated with every write. The difference is that the penalty here is at most MAX x t writes vs. n writes, where n is the length of our input array, and t is the number of threads employed.