14. Leveraging Shared Memory in the Multicore Era

The world is changed.

I feel it in the silica.

I feel it in the transistor.

I see it in the core.

~ With apologies to Galadriel (LoTR: Fellowship of the Ring)

Until now, our discussion of architecture focused on a purely single-CPU world. But the world has changed. Today’s CPUs have multiple cores, or compute units. In this chapter, we discuss multicore architectures, and how to leverage them to speed up the execution of programs.

CPUs, Processors and Cores

In many instances in this chapter, the terms processor and CPU are used interchangeably. At a fundamental level, a processor is any circuit that performs some computation on external data. Based on this definition, the Central Processing Unit (CPU) is an example of a processor. A processor or a CPU with multiple compute cores is referred to as a multicore processor or a multicore CPU. A core is a compute unit that contains many of the components that make up the classical CPU: an ALU, registers and a bit of cache. While a core is different from a processor, it is not unusual to see these terms used interchangeably in the literature (especially if the literature originated at a time when multicore processors were still considered novel).

In 1965, the founder of Intel, Gordon Moore, estimated that the number of transistors in an integrated circuit would double every year. His prediction, now known as Moore’s Law, was later revised to transistor counts doubling every two years. Despite the evolution of electronic switches from Bardeen’s transistor to the tiny chip transistors that are currently used in modern computers, Moore’s Law has held true for the last 50 years. However, the turn of the millennium saw processor design hit several critical performance walls:

  • The Memory Wall — Improvements in memory technology did not keep pace with improvements in clock speed, resulting in memory becoming a bottleneck to performance. As a result, continuously speeding up the execution of a CPU no longer improves its overall system performance.

  • The Power Wall — Increasing the number of transistors on a processor necessarily increases that processor’s temperature and power consumption, which in turn increases the required cost to power and cool the system. With the proliferation of multicore systems, power is now the dominant concern in computer system design.

The power and memory walls caused computer architects to change the way they designed processors. Instead of adding more transistors to increase the speed at which a CPU executes a single stream of instructions, architects began adding multiple compute cores to a CPU. Compute cores are simplified processing units that contain fewer transistors than traditional CPUs and are generally easier to create. Combining multiple cores on one CPU allows the CPU to execute multiple independent streams of instructions at once.

More cores != better

It may be tempting to assume that all cores are equal and that the more cores a computer has, the better it is. This is not necessarily the case! For example, graphics processing unit (GPU) cores have even less transistors than CPU cores, and are specialized for particular tasks involving vectors. A typical GPU can have 5,000 or more GPU cores. However, GPU cores are limited in the types of operations they can perform, and are not always suitable for general-purpose computing like the CPU core. Computing with GPUs is known as manycore computing. In this chapter, we concentrate on multicore computing. See chapter 15 for a discussion of manycore computing.

Taking a Closer Look: How Many Cores?

Almost all modern computer systems have multiple cores, including small devices like the Raspberry Pi. Identifying the number of cores on a system is critical for accurately measuring the performance of multicore programs. On Linux and OS X computers, the lscpu command provides a summary of a system’s architecture. Below we show the output of the lscpu command when run on a sample machine (some output is omitted to emphasize the key features):

$ lscpu

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    2
Core(s) per socket:    4
Socket(s):             1
Model name:            Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz
CPU MHz:               1607.562
CPU max MHz:           3900.0000
CPU min MHz:           1600.0000
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              8192K
...

The lscpu command gives a lot of useful information, including the type of processors, the core speed, and the number of cores. To calculate the number of physical (or actual) cores on a system, multiply the number of sockets by the number of cores per socket. The sample lscpu output shown above shows that the system has 1 socket with 4 cores per socket, or 4 total physical cores.

Hyper-threading

At first glance, it may appear that the above system has 8 total cores. After all, this is what the CPU(s) field seems to imply. However, the CPU(s) field actually indicates the number of hyper-threaded (logical) cores, not the number of physical cores. Hyper-threading, or Simultaneous Multi-threading (SMT), enables the efficient processing of multiple threads on a single core. While hyper-threading can decrease the overall run-time of a program, performance on hyper-threaded cores does not scale at the same rate as on physical cores. However, if one task idles (e.g., due to a control hazard), another task can still utilize the core. In short, hyper-threading was introduced to improve process throughput (which measures the number of processes that complete in a given unit of time) rather than process speedup (which measures the amount of run-time improvement of an individual process). Much of our discussion of performance in the coming chapter will focus on speedup.