5.7. Pipelining: Making the CPU faster

Our four stage CPU takes 4 cycles to execute one instruction: the first cycle is used to fetch the instruction from memory; the second to decode the instruction and read operands from the register file; the third for the ALU to execute the operation; and the fourth to write-back the ALU result to a register in the register file. To execute a sequence of N instructions takes 4 times N clock cycles, as each are executed one at a time in order by the CPU.

12 cycles to complete 3 instruction
Figure 1. Executing 3 instructions takes 12 total cycles.

Figure 1 shows three instructions taking a total of 12 cycles to execute, 4 cycles per instruction, resulting in a CPI of 4 (CPI is the average number of cycles to execute an instruction). However, the control circuitry of the CPU can be improved to achieve a better (lower) CPI value.

In considering the pattern of execution where each instruction takes 4 cycles to execute, followed by the next instruction taking 4 cycles and so on, the CPU circuitry associated with implementing each stage is only actively involved in instruction execution once every 4 cycles. For example, after the fetch stage, the fetch circuitry in the CPU is not used to perform any useful action related to executing an instruction for the next 3 clock cycles. If, however, the fetch circuitry could continue to actively execute the fetch parts of subsequent instructions in the next 3 cycles, the CPU could complete the execution of more than a single instruction every 4 cycles.

CPU pipelining is this idea of starting the execution of the next instruction before the current instruction has fully completed its execution. CPU pipelining executes instructions in order, but it allows the execution of a sequence of instructions to overlap. For example, in the first cycle the first instruction enters its fetch stage of execution. In the second cycle, the first instruction moves to its decode stage, and the second instruction simultaneously enters its fetch stage. In the third cycle, the first instruction moves to its execution stage, the second instruction to its decode stage, and the third instruction is fetched from memory. In the fourth cycle the first instruction moves to its write-back stage and completes, the second instruction moves to its execution stage, the third to its decode, and the fourth instruction enters its fetch stage. At this point, the CPU pipeline of instructions is full—​every CPU stage is actively executing program instructions where each subsequent instruction is one stage behind its predecessor. When the pipeline is full, the CPU completes the execution of one instruction every clock cycle!

pipelined execution of instructions
Figure 2. Pipelining: overlapping instruction execution to achieve 1 instruction completed per cycle. The circle indicates that the CPU has reached the steady state of completing one instruction every cycle.

Figure 2 shows an example of pipelined instruction execution through our CPU. Starting with the 4th clock cycle the pipeline fills, meaning that the CPU completes the execution of one instruction every cycle, achieving a CPI of 1 (shown in the circle in Figure 2). Notice that the total number of cycles required to execute a single instruction (the instruction latency) has not decreased in pipelined execution—​it still takes 4 cycles for each instruction to execute. Instead, pipelining increases instruction throughput, or the number of instructions that the CPU can execute in a given period of time, by overlapping the execution of sequential instructions in a staggered manner, through the different stages of the pipeline.

Since the 1970s computer architects have used pipelining as a way to drastically improve the performance of microprocessors. However, pipelining comes at the cost of a more complicated CPU design than one that does not support pipelined execution. Extra storage and control circuitry is needed to support pipelining. For example, multiple instruction registers may be need to store the multiple instructions currently in the pipeline. This added complexity is almost always worth the large improvements in CPI that pipelining provides and as a result, most modern microprocessors implement pipelined execution.

The idea of pipelining is also used in other contexts in computer science to speed-up execution and is an idea applied to many non-CS applications as well. Consider, for example, the task of doing multiple loads of laundry using a single washing machine. If completing one laundry consists of four steps (washing, drying, folding and putting away clothes), then after washing the first load, the second load can go in the washing machine while the first load is in the dryer, overlapping the washing of individual laundry loads to speed-up that total time it takes to wash four loads. Factory assembly lines are another example of pipelining.

In our discussion of how a CPU executes program instructions and CPU pipelining, we use a simple 4-stage pipeline and an example ADD instruction. To execute instructions that load and store values between memory and registers, a 5-stage pipeline is used. A 5-stage pipeline includes a Memory stage for memory access: Fetch-Decode-Execute-Memory-Writeback. Different processors may have fewer or more pipeline stages than a typical 5-stage pipeline. For example, the initial ARM architecture had 3 stages (Fetch, Decode, and Execute, where the Execute stage performed both the ALU execution and the Register File writeback functionality). More recent ARM architectures have more than 5-stage pipelines. The initial Intel Pentium architectures had a 5-stage pipeline, but later architectures had significantly more pipeline stages. For example, the Intel Core i7 has a 14 stage pipeline.