14.1. Programming Multicore Systems

Most of the common languages that programmers know today were created prior to the multicore age. As a result, many languages cannot implicitly (or automatically) employ multicore processors to speed up the execution of a program. Instead, programmers must specifically write software to leverage the multiple cores on a system.

14.1.1. The impact of multicore systems on process execution

Recall that a process can be thought of as an abstraction of a running program. Each process executes in its own virtual address space. The operating system schedules processes for execution on the CPU; a context switch occurs when the CPU changes which process it currently executes.

Figure 1 illustrates how five example processes may execute on a single-core CPU:

concurrency example with 5 processes
Figure 1. An execution time sequence for five processes as they share a single CPU core.

The horizontal axis is time, with each time slice taking one unit of time. A box indicates when a process is using the single core CPU. Assume that each process executes for one full time slice before a context switch occurs. So, Process 1 uses the CPU during time steps T1 and T3.

In the above example, the order of process execution is P1, P2, P1, P2, P4, P2, P3, P4, P5, P3, P5. We take a moment here to distinguish between two measures of time. The CPU time measures the amount of time a process takes to execute on a CPU. In contrast, the wall clock time measures the amount of time a human perceives a process takes to complete. The wall clock time is often significantly longer than the CPU time, due to context switches. For example, process 1’s CPU time requires two time units, while its wall clock time is three time units.

When the total execution time of one process overlaps with another, the processes are running concurrently with each other. Operating systems employed concurrency in the single-core era to give the illusion that a computer can execute many things at once (e.g., you can have a calculator program, a web browser, and a word processing document all open at the same time). In truth, each process executes serially and the operating system determines the order in which processes execute and complete (which often differs in subsequent runs).

In the example above, observe that Process 1 and Process 2 run concurrently with each other, since their executions overlap at time points T2-T4. Likewise, Process 2 runs concurrently with Process 4, as their executions overlap at time points T4-T6. In contrast, Process 2 does not run concurrently with Process 3, as they share no overlap in their execution — Process 3 only starts running at time T7, while Process 2 completes at time T6.

A multicore CPU enables the operating system to schedule a different process to each available core, allowing processes to execute simultaneously. The simultaneous execution of instructions from processes running on multiple cores is referred to as parallel execution. Figure 2 shows how our example processes might execute on a dual-core system:

parallel example with 2 cores
Figure 2. An execution time sequence for five processes, extended to include two CPU cores (one in dark blue, another in light green).

In this example, the two CPU cores are colored differently. Suppose the process execution order is again P1, P2, P1, P2, P4, P2, P3, P4, P5, P3, P5. The presence of multiple cores enables certain processes to execute sooner. For example, during time unit T1, the first core executes Process 1 while the second core executes Process 2. At time T2, the first core executes Process 2 while the second executes Process 1. Thus, Process 1 finishes executing after time T2, while Process 2 finishes executing at time T3.

Note that the parallel execution of multiple processes increases just the number of processes that execute at any one time. In Figure 2, all the processes complete execution by time unit T7. However, each individual process still requires the same amount of CPU time to complete as shown in Figure 1. For example, Process 2 requires three time units regardless of execution on a single or multicore system (i.e., its CPU time remains the same). A multicore processor increases the throughput of process execution, or the number of processes that can complete in a given period of time. Thus, while the CPU time of an individual process remains unchanged, its wall time may decrease.

14.1.2. Expediting Process Execution with Threads

One way to speed up the execution of a single process is to decompose it into lightweight, independent execution flows called threads. Figure 3 shows how a process' virtual address space changes when it is multi-threaded with two threads. While each thread has its own private allocation of call stack memory, all threads share the program data, instructions, and the heap allocated to the multi-threaded process.

multithread process with 4 threads
Figure 3. Comparing the virtual address space of a single threaded and a multi-threaded process with 2 threads.

The operating system schedules threads in the same manner as it schedules processes. On a multicore processor, the O.S. can speed up the execution of a multi-threaded program by scheduling the different threads to run on separate cores. The maximum number of threads that can execute in parallel is equal to the number of physical cores on the system. If the number of threads exceed the number of physical cores, the remaining threads must wait their turn to execute (similar to how processes execute on a single core).

An Example: Scalar Multiplication

As an initial example of how to use multi-threading to speed up an application, consider the problem of performing scalar multiplication of an array array and some integer s. In scalar multiplication, each element in the array is scaled by multiplying the element with s.

A serial implementation of a scalar multiplication function is shown below:

void scalar_multiply(int * array, long length, int s) {
    for (i = 0; i < length; i++) {
      array[i] = array[i] * s;
    }
}

Suppose array has N total elements. To create a multi-threaded version of this application with t threads, it is necessary to:

  1. Create t threads.

  2. Assign each thread a subset of the input array (i.e. N/t elements).

  3. Instruct each thread to multiply the elements in its array subset by s.

Suppose the serial implementation of scalar_multiply spends 60 seconds multiplying an input array of 100 million elements. To build a version that executes with t=4 threads, we assign each thread one fourth of the total input array (25 million elements).

multi-threaded process on one core
Figure 4. Running four threads on a single core CPU.

Figure 4 shows what happens when we run 4 threads on a single core. As before, the execution order is left to the operating system. In this scenario, assume that the thread execution order is (Thread 1, Thread 3, Thread 2, Thread 4). On a single core processor (represented by the squares), each thread executes sequentially. Thus, the multi-threaded process running on one core will still take 60 seconds to run (perhaps a little longer, given the overhead of creating threads).

multi-threaded process on two cores
Figure 5. Running four threads on a dual-core CPU.

Now suppose we run our multi-threaded process on a dual-core system. Figure 5 shows the result. Again, assume t=4 threads, and that the thread execution order is (Thread 1, Thread 3, Thread 2, Thread 4). Our two cores are represented by shaded squares. Since the system is dual-core, Thread 1 and Thread 3 execute in parallel during time step T1. Threads 2 and 3 then execute in parallel during time step T2. Thus, the multi-threaded process that originally took 60 seconds to run now runs in 30 seconds.

multi-threaded process on four cores
Figure 6. Running four threads on a quad-core CPU.

Finally, suppose that the multi-threaded process (t=4) is run on a quad-core CPU. Figure 6 shows one such execution sequence. Each of the four cores in Figure 6 is shaded differently. On the quad-core system, each thread executes in parallel during time slice T1. Thus, on a quad-core CPU, the multi-threaded process that originally took 60 seconds now runs in 15 seconds.

In general, if the number of threads match the number of cores (c) and the operating system schedules each thread to run on a separate core in parallel, then the multi-threaded process should run in approximately 1/c of the time. Such linear speedup is ideal, but not frequently observed in practice. For example, if there are many other processes (or multi-threaded processes) waiting to use the CPU, they will all compete for the limited number of cores, resulting in resource contention amongst the processes. If the number of specified threads exceeds the number of CPU cores, each thread must wait its turn to run. We explore other factors that often prevent linear speedup later in this chapter.