14.5. Cache Coherence and False Sharing
Multicore caches can have profound implications on a multi-threaded program’s performance. First however, let’s quickly review some of the basic concepts related to cache design:
-
Data/instructions are not transported individually to the cache. Instead, data is transferred in blocks, and block sizes tend to get larger at lower levels of the memory hierarchy.
-
Each cache is organized into a series of sets, with each set having a number of lines. Each line holds a single block of data.
-
The individual bits of a memory address are used to determine which set, tag, and block offset of the cache to write a block of data to.
-
A cache hit occurs when the desired data block exists in the cache. Otherwise, a cache miss occurs, and a lookup is performed on the next lower level of the memory hierarchy (which can be cache or main memory).
-
The valid bit indicates if a block at a particular line in the cache is safe to use. If the valid bit is set to 0, the data block at that line cannot be used (e.g. they could contain data from an exited process).
-
Information is written to cache/memory based on two main strategies. In the write-through strategy, the data is written to cache and main memory simultaneously. In the write-back strategy, data is only written to cache and gets written to lower levels in the hierarchy once the block is evicted from the cache.
14.5.1. Caches on multicore systems
Recall that in shared memory architectures, each core can have its own cache, and that multiple cores can share a common cache. Figure 1 depicts an example dual-core CPU. While each core has its own local L1 cache, the cores share a common (or unified) L2 cache.
Multiple threads in a single executable may execute separate functions.
Let’s consider one such example. Assume the dual-core processor in
Figure 1, where each core is busy executing separate threads
concurrently. The thread assigned to Core 0 has a local variable x
, while the
thread executing on Core 1 has a local variable y
, and both threads have
shared access to a global variable g
. Table 1 shows one possible path
of execution.
Time | Core 0 | Core 1 |
---|---|---|
0 |
g = 5 |
(other work) |
1 |
(other work) |
y = g*4 |
2 |
x += g |
y += g*2 |
Suppose the initial value of g
is 10, and the initial values of x
and y
are 0 respectively. What is the final value of y
at the end of this sequence
of operations? This is a very difficult question to answer, since there are at
least three stored values of g
: one in Core 0’s L1 cache, one in Core 1’s L1
cache, and a separate copy of g
stored in the shared L2 cache.
Figure 2 shows one possible result after the sequence of operations
in Table 1 completes. Suppose the L1 caches implement a write-back
policy. When the thread executing on Core 0 writes the value 5 to g
, it only
updates the value of g
in Core 0’s L1 cache. The value of g
in Core 1’s L1
cache still remains 10, as does the copy in the shared L2 cache. Even if a
write-through policy is implemented, there is no guarantee that the copy of
g
stored in Core 1’s L1 cache gets updated! In this case, y
will have
the final value of 60
.
Cache coherence is a strategy of ensuring that each cache maintains a consistent view of shared memory. Cache coherence policies either invalidate or update cached copies of shared values in other caches when a write to the shared data value is made in one cache. For example, snoopy caches "snoop" on the bus for possible write signals. If a write to a shared variable is detected, the line containing that variable is then invalidated in all caches.
Employing a snoopy cache for the scenario above would yield the following sequence of operations:
-
During time step 0, 5 is written to shared variable
g
in Core 0’s L1 cache. The snoopy cache sees the write tog
and invalidates the line containingg
in Core 1’s L1 cache, and the shared L2 cache. As a result, the write tog
results in an update to Core 0’s L1 cache and the shared L2 cache. Once the value 5 is written to the shared L2 cache, the valid bit associated with the line containingg
in the L2 cache is set to 1. -
During time step 1, an access to
g
is attempted in Core 1’s L1 cache. However, since the line thatg
is on has a valid bit set to 0 (indicating that it was invalidated), the L2 cache is checked. The line in L2 associated withg
has a valid bit set to 1, so the block containing the new value ofg
is swapped into Core 1’s L1 cache. As a result, the block associated withg
in Core 1 now contains the value 5. The valid bit associated with the line containingg
is then set to 1. When the computationy = g * 4
is performed,y
is set to20
. -
When
y
is updated again during time step 2 by Core 1, a lookup ong
shows that the cache line is still valid (recall that only writes invalidate the cache line!). Therefore the value10
is added toy
, yielding the final desired answer of30
.
14.5.2. False Sharing
While cache coherence guarantees correctness, it can potentially harm
performance. Recall that when the thread updates g
on Core 0, the snoopy
cache does not invalidate only g
, but the entire cache line that g
is a
part of.
Consider the initial attempt at
parallelizing the countElems()
function of the CountSort algorithm. For
convenience, the function is reproduced below:
/*parallel version of step 1 (first cut) of CountSort algorithm:
* extracts arguments from args value
* calculates the component of the array that thread is responsible for counting
* computes the frequency of all the elements in assigned component and stores
* the associated counts of each element in counts array
*/
void *countElems( void *args ){
//extract arguments
//ommitted for brevity
int *array = myargs->ap;
long *counts = myargs->countp;
//assign work to the thread
//compute chunk, start, and end
//ommited for brevity
long i;
//heart of the program
for (i = start; i < end; i++){
val = array[i];
counts[val] = counts[val] + 1;
}
return NULL;
}
In our previous discussion of
this function, we pointed out how data races can cause the counts
array to
not populate with the correct set of counts. Let’s see what happens if we
attempt to time this function. We add timing code to main()
using
getimeofday()
in the exact manner as shown in
countElems_p_v3.c. Benchmarking the
initial version of countsElem
as shown above on 100 million elements yields
the following times:
$ ./countElems_p 100000000 0 1 Time for Step 1 is 0.336239 s $ ./countElems_p 100000000 0 2 Time for Step 1 is 0.799464 s $ ./countElems_p 100000000 0 4 Time for Step 1 is 0.767003 s
Even without any synchronization constructs, this version of the program still gets slower as the number of threads increases!
To understand what is going on, let’s revisit counts
array. The counts
array holds the frequency of the occurrence of each number in our input array.
The maximum value is determined by the variable MAX
. In our example program,
MAX
is set to 10. In other words, the counts
array takes up 40 bytes of
space.
Now, let’s learn a bit more about the cache on our system. Type the following
command to display cache information for a core (cpu0
).
$ ls /sys/devices/system/cpu/cpu0/cache index0/ index1/ index2/ index3/ power/ uevent
The output indicates that the core indicated by cpu0
has 4 caches
(index0
- index3
). To get the details of each cache (including the type of
cache it is and what it is used for), examine the index directory’s type
and
level
files:
$ cat /sys/devices/system/cpu/cpu0/cache/index*/type Data Instruction Unified Unified $ cat /sys/devices/system/cpu/cpu0/cache/index*/level 1 1 2 3
The type
output tells us that the core referred to by cpu0 has separate data
and instruction caches, and two unified (shared) caches. Correlating the
level
output with the type
output reveals that the data and instruction
caches are both L1 caches, while the unified caches are L2 and L3 caches
respectively.
Looking closer at the level 1 data cache and its corresponding line size shows:
$ ls /sys/devices/system/cpu/cpu0/cache/index0 coherency_line_size power type level shared_cpu_list uevent number_of_sets shared_cpu_map ways_of_associativity physical_line_partition size $ cat /sys/devices/system/cpu/cpu0/cache/coherency_line_size 64
From this information, it can be gleaned that the L1 cache line size for the
machine is 64 bytes. In other words, the 40-byte counts
array fits within
one cache line.
Recall that with cache coherence, every time a program updates a shared variable, the entire cache line in other caches storing the variable is invalidated. Let’s consider what happens when two threads execute the above function. Here’s one possible path of execution (assume that each thread is assigned to a separate core, and the variable x is local to each thread):
Time | Thread 0 | Thread 1 |
---|---|---|
i |
reads array[x] (1) |
… |
i+1 |
increments counts[1] (invalidates cache line) |
reads array[x] (4) |
i+2 |
reads array[x] (6) |
increments counts[4] (invalidates cache line) |
i+3 |
increments counts[6] (invalidates cache line) |
reads array[x] (2) |
i+4 |
reads array[x] (3) |
increments counts[2] (invalidates cache line) |
i+5 |
increments counts[3] (invalidates cache line) |
… |
-
During time step i, thread 0 reads the value at
array[x]
in its part of the array, which is a 1 in this example. -
During time steps i+1 to i+5, each thread reads a value from
array[x]
. Note that each thread is looking at different components of the array. Not only that, each read ofarray
in our sample execution yields unique values (so no race conditions in this sample execution sequence!). After reading the value fromarray[x]
, each thread increments the associated value incounts
. -
Recall that the
counts
array fits on a single cache line in our L1 cache. As a result, every write tocounts
invalidates the entire line in every other L1 cache. -
The end result is that, despite updating different memory locations in
counts
, any cache line containingcounts
is invalidated with every update tocounts
!
The invalidation forces all L1 caches to update the line with a "valid" version from L2. The repeated invalidation and overwriting of lines from the L1 cache is an example of thrashing, where repeated conflicts in the cache cause a series of misses.
The addition of more cores makes the problem worse, since now more L1 caches
are invalidating the line. As a result, adding additional threads slows down
the run time, despite the fact that each thread is accessing different elements
of the counts
array! This is an example of false sharing, or an
illusion that individual elements are being shared by multiple cores. In the
example above, it appears all the cores are accessing the same elements of
counts
, even though this is not the case.
14.5.3. Fixing False Sharing
One way to fix an instance of false sharing is to pad the array (in our case
counts
) with additional elements so that it doesn’t fit in a single cache
line. However, padding can waste memory, and may not eliminate the problem from
all architectures (consider the scenario where two different machines have
different L1 cache sizes). In most cases, writing code to support different
cache sizes is generally not worth the gain in performance.
A better solution is to have threads write to local storage whenever possible.
Local storage in this context refers to the memory that is local to a thread.
The following solution reduces false sharing by choosing to perform updates to
locally declared version of counts
called local_counts
.
Let’s revisit the final version of our countElems()
function (reproduced from
countElems_p_v3.c):
/*parallel version of step 1 of CountSort algorithm (final attempt with mutexes):
* extracts arguments from args value
* calculates the component of the array that thread is responsible for counting
* computes the frequency of all the elements in assigned component and stores
* the associated counts of each element in counts array
*/
void *countElems( void *args ){
//extract arguments
//omitted for brevity
int *array = myargs->ap;
long *counts = myargs->countp;
long local_counts[MAX] = {0}; //local declaration of counts array
//assign work to the thread
//compute chunk, start, and end values (omitted for brevity)
long i;
//heart of the program
for (i = start; i < end; i++){
val = array[i];
local_counts[val] = local_counts[val] + 1; //updates local counts array
}
//update to global counts array
pthread_mutex_lock(&mutex); //acquire the mutex lock
for (i = 0; i < MAX; i++){
counts[i] += local_counts[i];
}
pthread_mutex_unlock(&mutex); //release the mutex lock
return NULL;
}
The use of local_counts
to accumulate frequencies in lieu of counts
is
the major source of reduction of false sharing in this example:
for (i = start; i < end; i++){
val = array[i];
local_counts[val] = local_counts[val] + 1; //updates local counts array
}
Since cache coherence is meant to maintain a consistent view of shared memory,
the invalidations only trigger on writes to shared values in memory. Since
local_counts
is not shared amongst the different threads, a write to it will
not invalidate its associated cache line.
In the last component of the code, the mutex enforces correctness by ensuring
that only one thread updates the shared counts
array at a time:
//update to global counts array
pthread_mutex_lock(&mutex); //acquire the mutex lock
for (i = 0; i < MAX; i++){
counts[i] += local_counts[i];
}
pthread_mutex_unlock(&mutex); //release the mutex lock
Since counts
is located on a single cache line, it will still get
invalidated with every write. The difference is that the
penalty here is at most MAX
x t writes vs. n writes, where n is the
length of our input array, and t is the number of threads employed.