Dive Into Systems

5.8. Advanced Pipelined Instruction Considerations

Recall that pipelining improves the performance of a processor by overlapping the execution of multiple instructions. In our earlier discussion on pipelining, we described a simple four-stage pipeline with the basic stages of Fetch (F), Decode (D), Execute (E) and WriteBack (W). In our discussion that follows, we also consider a fifth stage, Memory (M), which represents an access to data memory. Our five-stage pipeline therefore comprises the following stages:

Fetch (F): reads an instruction from memory (pointed to by the program counter).
Decode (D): reads source registers and sets control logic.
Execute (E): executes the instruction.
Memory (M): reads from or writes to data memory.
WriteBack (W): stores a result in a destination register.

Recall that the compiler transforms lines of code into a series of machine code instructions for the CPU to execute. Assembly code is a human-readable version of machine code. The snippet below displays a series of made-up assembly instructions:

MOV M[0x84], Reg1     # move value at memory address 0x84 to register Reg1
ADD 2, Reg1, Reg1     # add 2 to value in Reg1 and store result in Reg1
MOV 4, Reg2           # copy the value 4 to register Reg2
ADD Reg2, Reg2, Reg2  # compute Reg2 + Reg2, store result in Reg2
JMP L1<0x14>          # jump to executing code at L1 (code address 0x14)

Don’t worry if you are having trouble parsing the snippet — we cover assembly in greater detail in upcoming chapters. For now, it suffices to focus on the following set of facts:

Every ISA defines a set of instructions.
Each instruction operates on one or more operands (i.e. registers, memory, or constant values).
Not all instructions require the same number of pipeline stages to execute.

In our previous discussion, it was assumed that every instruction takes the same number of cycles to execute; however, this is usually not the case. For example, the first MOV instruction requires all five stages, as it requires the movement of data from memory to a register. In contrast, the next three instructions require only four stages (F, D, E, W) to execute given that the operations involve only registers, and not memory. The last instruction (JMP) is a type of branch or conditional instruction. Its purpose is to transfer the flow of control to another part of the code. Specifically, addresses in the code region of memory reference different instructions in an executable. Since the JMP instruction does not update a general-purpose register, the WriteBack stage is omitted, resulting in only three stages (F, D, E) being required. We cover conditional instructions in greater detail in the upcoming chapters on assembly.

A pipeline stall results when any instruction is forced to wait for another to finish executing before it can continue. Compilers and processors do whatever they can to avoid pipeline stalls in order to maximize performance.

5.8.1. Pipelining Consideration: Data Hazards

A data hazard occurs when two instructions attempt to access common data in an instruction pipeline. As an example, consider the first pair of instructions from the code snippet above:

MOV M[0x84], Reg1     # move value at memory address 0x84 to register Reg1
ADD 2, Reg1, Reg1     # add 2 to value in Reg1 and store result in Reg1

Figure 1. An example of a pipeline hazard arising from two instructions simultaneously reaching the same pipeline stage.

Recall that this MOV instruction requires five stages (as it involves an access to memory), whereas the ADD instruction requires only four. In this scenario, both instructions will attempt to write to register Reg1 at the same time (see Figure 1).

The processor prevents the aforementioned scenario by first forcing every instruction to take five pipeline stages to execute. For instructions that normally take fewer than five stages, the CPU adds a "no-operation" (NOP) instruction (also called a pipeline "bubble") to substitute for that phase.

However, the problem is still not fully resolved. Since the goal of the second instruction is to add 2 to the value stored in register Reg1, the MOV instruction needs to finish writing to register Reg1 before the ADD instruction can execute correctly. A similar problem exists in the next two instructions:

MOV 4, Reg2           # copy the value 4 to register Reg2
ADD Reg2, Reg2, Reg2  # compute Reg2 + Reg2, store result in Reg2

Figure 2. The processor can reduce the damage caused by pipeline hazards by forwarding operands between instructions.

These two instructions load the value 4 into register Reg2 and then multiply it by 2 (by adding to itself). Once again, bubbles are added to enforce that each instruction takes five pipeline stages. In this case, regardless of the bubbles, the second instruction’s execute phase occurs before the first instruction finishes writing the required value (4) to register Reg2.

Adding more bubbles is a suboptimal solution, because it stalls the pipeline. Instead, processors employ a technique called operand forwarding, in which the pipeline reads the result from the previous operation. Looking at Figure 2, while the instruction MOV 4, Reg2 executes, it forwards its results to the instruction ADD Reg2, Reg2, Reg2. So, while the MOV instruction is writing to register Reg2, the ADD instruction can use the updated value of Reg2 that it received from the MOV instruction.

5.8.2. Pipelining Hazards: Control Hazards

The pipeline is optimized for instructions that occur one after another. Control changes in a program arising from conditionals such as if statements or loops can seriously affect the pipeline performance. Let’s take a look at a different example code snippet, first in C:

int result = *x; // x holds an int
int temp = *y;   // y holds another int

if (result <= temp) {
	result = result - temp;
}
else {
	result = result + temp;
}
return result;

This snippet simply reads integer data from two different pointers, compares the values, and then does different arithmetic based on the result. Here is how the above code snippet may translate into assembly instructions:

  MOV M[0x84], Reg1     # move value at memory address 0x84 to register Reg1
  MOV M[0x88], Reg2     # move value at memory address 0x88 to register Reg2
  CMP Reg1, Reg2        # compare value in Reg1 to value in Reg2
  JLE L1<0x14>          # switch code execution to L1 if Reg1 less than Reg2
  ADD Reg1, Reg2, Reg1  # compute Reg1 + Reg2, store result in Reg1
  JMP L2<0x20>          # switch code execution to L2 (code address 0x20)
L1:
  SUB Reg1, Reg2, Reg1  # compute Reg1 - Reg2, store in Reg1
L2:
  RET                   # return from function

This sequence of instructions loads data from memory into two separate registers, compares the values, and then does different arithmetic based on whether the value in the first register is less than the value in the second. The if statement is represented in the above example with two instructions: the compare (CMP) instruction and a conditional jump less than (JLE) instruction. We cover conditional instructions in greater detail in the upcoming assembly chapters; for now it is sufficient to understand that the CMP instruction compares two registers, while the JLE instruction is a special type of branch instruction that switches code execution to another part of the program if and only if the condition (i.e. less than or equal, in this case) is true.

Don’t get overwhelmed by the details!

Looking at assembly for the first time can be understandably intimidating. If this is how you feel, try not to worry! We cover assembly in much greater detail in coming chapters. The key takeaway is that code containing conditional statements translates to a series of assembly instructions just like any other code snippet. However, unlike other code snippets, conditional statements are not guaranteed to execute in a particular way. The uncertainty surrounding how a conditional statement executes has large ramifications for the pipeline.

Figure 3. An example of a control hazard resulting from a conditional branch.

A control hazard occurs when the pipeline encounters a branch (or conditional) instruction. When this happens, the pipeline has to "guess" whether the branch will be taken. If the branch is not taken, the process continues to execute the next instructions in sequence. Consider the example in Figure 3. If the branch is taken, the next instruction that executes should be the SUB instruction. However, it is impossible to know whether the branch is taken until the JLE instruction finishes executing. At that point, the ADD and JMP instructions have already been loaded into the pipeline. If the branch is taken, these "junk" instructions in the pipeline need to be removed, or flushed, before the pipeline can be reloaded with new instructions. Flushing the pipeline is expensive.

There are a few options that hardware engineers can choose to implement to help the processor deal with control hazards:

Stall the pipeline: As a simple solution, whenever there is a branch, add lots of NOP bubbles and stall the pipeline until the processor is sure that the branch is taken. Although stalling the pipeline will fix the issue, it will also lead to a performance hit (see Figure 4).
Branch prediction: The most common solution is to use a branch predictor, which will predict which way a branch will go, based on previous executions. Modern branch predictors are really good and accurate. However, this approach has recently caused some security vulnerabilities (e.g. Spectre¹). Figure 4 depicts how a branch predictor may deal with the control hazard discussed.
Eager execution: In eager execution, the CPU executes both sides of the branch and performs a conditional transfer of data rather than control (implemented through the cmov and the csel instructions in x86 and ARMv8-A, respectively). A conditional transfer of data enables the processor to continue execution without disrupting the pipeline. However, not all code is capable of taking advantage of eager execution, which can be dangerous in the case of pointer dereferences and side effects.

Figure 4. Potential solutions for handling control hazards.

References

Peter Bright. Google: Software is never going to be able to fix Spectre-type bugs Ars Technica 2019.