10. Leveraging Shared Memory in the Multicore Era
The world is changed.
I feel it in the silica.
I feel it in the transistor.
I see it in the core.
~ With apologies to Galadriel (Lord of the Rings: Fellowship of the Ring)
Until now, our discussion of architecture has focused on a purely single-CPU world. But the world has changed. Today’s CPUs have multiple cores, or compute units. In this chapter, we discuss multicore architectures, and how to leverage them to speed up the execution of programs.
CPUs, Processors, and Cores
In many instances in this chapter, the terms processor and CPU are used interchangeably. At a fundamental level, a processor is any circuit that performs some computation on external data. Based on this definition, the central processing unit (CPU) is an example of a processor. A processor or a CPU with multiple compute cores is referred to as a multicore processor or a multicore CPU. A core is a compute unit that contains many of the components that make up the classical CPU: an ALU, registers, and a bit of cache. Although a core is different from a processor, it is not unusual to see these terms used interchangeably in the literature (especially if the literature originated at a time when multicore processors were still considered novel). |
In 1965, the founder of Intel, Gordon Moore, estimated that the number of transistors in an integrated circuit would double every year. His prediction, now known as Moore’s Law, was later revised to transistor counts doubling every two years. Despite the evolution of electronic switches from Bardeen’s transistor to the tiny chip transistors that are currently used in modern computers, Moore’s Law has held true for the past 50 years. However, the turn of the millennium saw processor design hit several critical performance walls:
-
The memory wall: Improvements in memory technology did not keep pace with improvements in clock speed, resulting in memory becoming a bottleneck to performance. As a result, continuously speeding up the execution of a CPU no longer improves its overall system performance.
-
The power wall: Increasing the number of transistors on a processor necessarily increases that processor’s temperature and power consumption, which in turn increases the required cost to power and cool the system. With the proliferation of multicore systems, power is now the dominant concern in computer system design.
The power and memory walls caused computer architects to change the way they designed processors. Instead of adding more transistors to increase the speed at which a CPU executes a single stream of instructions, architects began adding multiple compute cores to a CPU. Compute cores are simplified processing units that contain fewer transistors than traditional CPUs and are generally easier to create. Combining multiple cores on one CPU allows the CPU to execute multiple independent streams of instructions at once.
More cores != better
It may be tempting to assume that all cores are equal and that the more cores a computer has, the better it is. This is not necessarily the case! For example, graphics processing unit (GPU) cores have even fewer transistors than CPU cores, and are specialized for particular tasks involving vectors. A typical GPU can have 5,000 or more GPU cores. However, GPU cores are limited in the types of operations that they can perform and are not always suitable for general-purpose computing like the CPU core. Computing with GPUs is known as manycore computing. In this chapter, we concentrate on multicore computing. See Chapter 15 for a discussion of manycore computing. |
Taking a Closer Look: How Many Cores?
Almost all modern computer systems have multiple cores, including small
devices like the Raspberry Pi. Identifying
the number of cores on a system is critical for accurately measuring the performance of
multicore programs. On Linux and macOS computers, the lscpu
command provides a
summary of a system’s architecture. In the following example, we show the output
of the lscpu
command when run on a sample machine
(some output is omitted to emphasize the key features):
$ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 Model name: Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz CPU MHz: 1607.562 CPU max MHz: 3900.0000 CPU min MHz: 1600.0000 L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 8192K ...
The lscpu
command gives a lot of useful information, including the type of
processors, the core speed, and the number of cores. To calculate the number of
physical (or actual) cores on a system, multiply the number of sockets by
the number of cores per socket. The sample lscpu
output shown above shows
that the system has one socket with four cores per socket, or four physical cores
in total.