10. Leveraging Shared Memory in the Multicore Era

The world is changed.

I feel it in the silica.

I feel it in the transistor.

I see it in the core.

~ With apologies to Galadriel (Lord of the Rings: Fellowship of the Ring)

Until now, our discussion of architecture has focused on a purely single-CPU world. But the world has changed. Today’s CPUs have multiple cores, or compute units. In this chapter, we discuss multicore architectures, and how to leverage them to speed up the execution of programs.

CPUs, Processors, and Cores

In many instances in this chapter, the terms processor and CPU are used interchangeably. At a fundamental level, a processor is any circuit that performs some computation on external data. Based on this definition, the central processing unit (CPU) is an example of a processor. A processor or a CPU with multiple compute cores is referred to as a multicore processor or a multicore CPU. A core is a compute unit that contains many of the components that make up the classical CPU: an ALU, registers, and a bit of cache. Although a core is different from a processor, it is not unusual to see these terms used interchangeably in the literature (especially if the literature originated at a time when multicore processors were still considered novel).

In 1965, the founder of Intel, Gordon Moore, estimated that the number of transistors in an integrated circuit would double every year. His prediction, now known as Moore’s Law, was later revised to transistor counts doubling every two years. Despite the evolution of electronic switches from Bardeen’s transistor to the tiny chip transistors that are currently used in modern computers, Moore’s Law has held true for the past 50 years. However, the turn of the millennium saw processor design hit several critical performance walls:

  • The memory wall: Improvements in memory technology did not keep pace with improvements in clock speed, resulting in memory becoming a bottleneck to performance. As a result, continuously speeding up the execution of a CPU no longer improves its overall system performance.

  • The power wall: Increasing the number of transistors on a processor necessarily increases that processor’s temperature and power consumption, which in turn increases the required cost to power and cool the system. With the proliferation of multicore systems, power is now the dominant concern in computer system design.

The power and memory walls caused computer architects to change the way they designed processors. Instead of adding more transistors to increase the speed at which a CPU executes a single stream of instructions, architects began adding multiple compute cores to a CPU. Compute cores are simplified processing units that contain fewer transistors than traditional CPUs and are generally easier to create. Combining multiple cores on one CPU allows the CPU to execute multiple independent streams of instructions at once.

More cores != better

It may be tempting to assume that all cores are equal and that the more cores a computer has, the better it is. This is not necessarily the case! For example, graphics processing unit (GPU) cores have even fewer transistors than CPU cores, and are specialized for particular tasks involving vectors. A typical GPU can have 5,000 or more GPU cores. However, GPU cores are limited in the types of operations that they can perform and are not always suitable for general-purpose computing like the CPU core. Computing with GPUs is known as manycore computing. In this chapter, we concentrate on multicore computing. See Chapter 15 for a discussion of manycore computing.

Taking a Closer Look: How Many Cores?

Almost all modern computer systems have multiple cores, including small devices like the Raspberry Pi. Identifying the number of cores on a system is critical for accurately measuring the performance of multicore programs. On Linux and macOS computers, the lscpu command provides a summary of a system’s architecture. In the following example, we show the output of the lscpu command when run on a sample machine (some output is omitted to emphasize the key features):

$ lscpu

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    2
Core(s) per socket:    4
Socket(s):             1
Model name:            Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz
CPU MHz:               1607.562
CPU max MHz:           3900.0000
CPU min MHz:           1600.0000
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              8192K
...

The lscpu command gives a lot of useful information, including the type of processors, the core speed, and the number of cores. To calculate the number of physical (or actual) cores on a system, multiply the number of sockets by the number of cores per socket. The sample lscpu output shown above shows that the system has one socket with four cores per socket, or four physical cores in total.

Hyperthreading

At first glance, it may appear that the system in the previous example has eight cores in total. After all, this is what the "CPU(s)" field seems to imply. However, that field actually indicates the number of hyperthreaded (logical) cores, not the number of physical cores. Hyperthreading, or simultaneous multithreading (SMT), enables the efficient processing of multiple threads on a single core. Although hyperthreading can decrease the overall runtime of a program, performance on hyperthreaded cores does not scale at the same rate as on physical cores. However, if one task idles (e.g., due to a control hazard), another task can still utilize the core. In short, hyperthreading was introduced to improve process throughput (which measures the number of processes that complete in a given unit of time) rather than process speedup (which measures the amount of runtime improvement of an individual process). Much of our discussion of performance in the coming chapter will focus on speedup.

Performance Cores and Efficiency Cores

On some newer architectures (such as Intel’s 12th generation processors and newer), multiplying the number of sockets by the numbers of cores and hardware threads yields a different (usually smaller) number than what is shown in the "CPU(s)" field. What’s going on here? The answer lies in the new heterogeneous architectures being developed by chip makers. For example, starting with its 12th generation processors, Intel has introduced an architecture that consists of a mixture of "Performance" cores (P-cores) and "Efficiency" cores (E-cores). The goal of this hybrid design is to delegate smaller background tasks to the smaller power-sipping E-cores, freeing the larger power-hungry P-cores for compute-intensive tasks. A similar principle drives the design of the earlier big.LITTLE mobile architecture introduced by Arm. On heterogeneous architectures, the default output of lscpu only displays the available P-cores; the number of E-cores can typically be computed by subtracting the number of P-cores from the total cores shown in the "CPU(s)" field. Invoking the lscpu command with its --all and --extended flags will display a full mapping of the P-cores and E-cores on a system, with the E-core identifiable by their lower processor speeds.