5.6. The Processor’s Execution of Program Instructions

Instruction execution is performed in several stages. Different architectures implement different numbers of stages, but most implement the Fetch, Decode, Execute, and WriteBack phases of instruction execution in four or more discrete stages. In discussing instruction execution, we focus on these four stages of execution, and we use an ADD instruction as our example. Our ADD instruction example is encoded in the following way:

the instruction format used as an example
Figure 1. An example instruction format for a three register operation. The instruction is encoded in binary with subsets of its bits corresponding to encodings of different parts of the instruction: the operation (opcode), the two source registers (the operands), and the destination register for storing the result of the operation. The example shows the encoding of an ADD instruction in this format.

To execute an instruction, the CPU first Fetches the next instruction from memory into a special-purpose register called the Instruction Register (IR). The memory address of the instruction to fetch is stored in another special-purpose register called the Program Counter (PC). The PC keeps track of the memory address of the next instruction to fetch, and is incremented as part of executing the fetch stage, so that it stores the value of the very next instruction’s memory address. For example, if all instructions are 32 bits long, then the PC’s value is incremented by 4 (each byte, 8 bits, has a unique address) to store the memory address of the instruction immediately following the one being fetched. Arithmetic circuits that are separate from the ALU increment the PC’s value. The PC’s value may also change during the WriteBack stage. For example, some instruction jump to specific addresses, such as those associated with the execution of loops, if-else, or function calls. Figure 2 shows the Fetch stage of execution.

CPU Fetch stage of execution
Figure 2. The fetch stage of instruction execution: the instruction at the memory address value stored in the Program Counter (PC) register is read from memory and stored into the Instruction Register (IR). The PC’s value is also incremented at the end of this stage (if instructions are 4 bytes, then the next address is 1238, actual instruction size varies by architecture and instruction type).

After fetching the instruction, the CPU decodes the instruction bits stored in the IR register into four parts: the high-order bits of an instruction encode the opcode, which specifies the operation to perform (e.g. ADD, SUB, OR, …​); and the remaining bits are divided into three subsets that specify the two operand sources and the result destination. In our example, we use registers for both sources and the result destination. The opcode and source bits are sent on wires that are inputs to the register file. The source bits are sent to the two read selection inputs (Sr_0_ and Sr_1_) that specify which register values are read from the register file. The decode stage is show in Figure 3.

CPU Decode stage of execution
Figure 3. The decode stage of instruction execution: separate instruction bits in the IR into components and send them as input to the ALU and register file. The opcode bits in the IR are sent to the ALU selection input to choose which operation to perform. The two sets of operand bits in the IR are sent to the selection inputs of the register file to pick from which registers to read the operand values. The destination bits in the IR are sent to the register file in the WriteBack stage. They specify the register to write the ALU result.

After the Decode stage determines the operation to perform and the operand sources, the ALU performs the operation in the next stage, the Execution stage. The ALU’s data inputs come from the two outputs of the register file, and its selection input comes from the opcode bits of the instruction. These inputs propagate through the ALU to produce a result that combines the operand values with the operation. In our example, the ALU outputs the result of adding the value stored in Reg1 to the value stored in Reg3, and outputs the condition code values associated with the result value. The execution stage is shown in Figure 4.

execution stage
Figure 4. The Execution Stage of Instruction Execution: the ALU performs the specified operation (from the instruction opcode bits) on its input values (from the Register File outputs).

In the WriteBack stage, the ALU result is stored in the destination register. The register file receives the ALU’s result output on its DataIn, the the destination register (from instructions bits in the IR) on its write-select, and 1 on its WE inputs. For example, if the destination register is Reg0, then the bits encoding Reg0 in the IR are sent as the Sw input to the register file to pick the destination register to which to write. The output from the ALU is sent as the DataIn input to the register file, and the WE bit is set to 1 to enable writing the ALU result into Reg0. The Write Back stage is show in Figure 5.

writeback stage
Figure 5. The write back stage of instruction execution: the result of the execution stage (the output from the ALU) is written to the destination register in the register file. The ALU output is the register file’s DataIn input, the destination bits of the instruction is the register file’s write-selection input (Sw), and the WE input is set to 1 to enable writing the DataIn value to the specified destination register.

5.6.1. Clock Driven Execution

A clock drives the CPU’s execution of instructions, triggering the start of each stages of an instruction’s execution. In other words, the clock is used by the CPU to determine when inputs to circuits associated with each stage are ready to use by the circuit, and it controls when outputs from circuits represent valid results from one stage and can be used as inputs to other circuits executing the next stage.

A CPU clock measures discrete time as opposed to continuous time. In other words, there exists a time 0, followed by a time 1 followed by a time 2, and so on for each subsequent clock tick. A processor’s clock cycle time measures the time between each clock tick. A processor’s clock speed (or clock rate) is 1/(clock cycle time). It is typically measured in megahertz (MHz) or gigahertz (GHz). A one MHz clock rate has one million clock ticks per second, and one gigahertz has one billion clock ticks per second. The clock rate is a measure of how fast the CPU can run, and is an estimate of the maximum number of instructions per second a CPU can execute. For example, on simple scalar processors like our example CPU, a 2GHz processor might achieve a maximum instruction execution rate of 2 billion instructions per second (or 2 instructions every nanosecond).

Although increasing the clock rate on a single machines will improve its performance, clock rate alone is not a meaningful metric to compare the performance of different processors. For example, some architectures (such as RISC) require fewer stages to execute instructions than others (such as CISC). In architectures with fewer execution stages a slower clock may yield the same number of instructions completed per second as on another architecture with a faster clock rate but more execution stages. For a specific microprocessor, however, doubling its clock speed will roughly double its instruction execution speed.

Clock Rates and Processor Performance

Historically, increasing the clock rate (along with designing more complicated and powerful microarchitectures that a faster clock can drive) has been a very effective way for computer architects to improve processor performance. For example, in 1974, the Intel 8080 CPU ran at 2 MHz (a clock rate of 2 million cycles per second). The clock rate of the Intel Pentium Pro, introduced in 1995, was 150 MHz (150 million cycles per second), and the clock rate of the Intel Pentium 4, introduced in 2000, was 1.3 GHz or (1.3 BILLION cycles per second). Clock rates peaked in the mid to late 2000s with processors like the IBM z10 with a clock rate of 4.4 GHz

Today, however, CPU clock rates have reached their limit due to problems associated with handling heat dissipation of faster clocks. This limit is known as the Power Wall. The Power Wall resulted in the development of multi-core processors starting in the mid 2000s. Multi-core processors have multiple "simple" CPU cores per chip, each core driven by a clock whose rate is not increased from the previous generation core; multi-core processor design is a way to improve CPU performance without having to increase the CPU clock rate.

Clock Circuit

A clock circuit uses an oscillator circuit to generate a very precise and regular pulse pattern. Typically, a crystal oscillator generates the base frequency of the oscillator circuit, and the pulse pattern of the oscillator is used by the clock circuit to output a pattern of alternating high and low voltages corresponding to an alternating pattern of 1 and 0 binary values. Figure 6 shows an example clock circuit generating a regular pattern of 1 and 0 output.

a clock circuit generates regular pattern of 1 and 0
Figure 6. The regular pattern of 1 and 0 output of a clock circuit. Each sequence of 1 and 0 makes up a clock cycle.

A clock cycle (or tick) is a 1 and 0 subsequence from the clock circuit pattern. The transition from a 1 to a 0 or a 0 to a 1 is called a clock edge. Clock edges trigger state changes in CPU circuits, driving the execution of instructions. The rising clock edge (the transition from 0 to 1 at the beginning of a new clock cycle) indicates a state in which input values are ready for a stage of instruction execution. For example, the rising edge transition signals that input values to the ALU circuit are ready. While the clock’s value is 1, these inputs propagate through the circuit until the output of the circuit is ready. This is called the propagation delay through the circuit. For example, while the clock signal is 1 the input values to the ALU propagate through ALU operation circuits and then through the multiplexer to produce the correct output from the ALU for the operation combining the input values. On the falling edge (the transition from 1 to 0) the outputs of the stage are stable and ready to be propagated to the next location (shown as "output ready" in Figure 7). For example, the output from the ALU is ready on the falling edge. For the duration of the clock value 0, the ALU’s output propagates to register files inputs. On the next clock cycle the rising edge indicates that the register file input value is ready to write into a register (shown as "new input" in Figure 7).

clock cycle
Figure 7. The rising edge of a new clock cycle triggers changes in the inputs to the circuits it controls. The falling edge triggers when the outputs are valid from the circuits it controls.

The length of the clock cycle (or the clock rate) is bounded by the longest propagation delay through any stage of instruction execution. The execution stage and propagation through the ALU is usually the longest stage. Thus, half of the clock cycle time must be no faster than the time it takes for the ALU input values to propagate through the slowest operation circuit to the ALU outputs (i.e., the outputs reflect the results of the operation on the inputs). For example, in our 4-operation ALU (OR, ADD, AND, and EQUALS), the ripple carry adder circuit has the longest propagation delay and determines the minimum length of the clock cycle.

Because it takes one clock cycle to complete one stage of CPU instruction execution, a processor with a four stage instruction execution sequence (Fetch, Decode, Execute, WriteBack), completes at most one instruction every four clock cycles.

4 clock cycles to complete 1 instruction
Figure 8. Four stage instruction execution takes 4 clock cycles to complete.

If, for example, the clock rate is 1GHz, one instruction takes 4 nanosecond to complete (each of the 4 stages taking 1 nanosecond). With a 2GHz clock rate, one instruction takes only 2 nanoseconds to complete.

Although clock rate is a factor in a processor’s performance, clock rate alone is not a meaningful measure of its performance. Instead, the average number of cycles per instruction (CPI) measured over a program’s full execution is a better measure of a CPU’s performance. Typically, a processor cannot maintain its maximum CPI for an entire program’s execution. A sub-maximum CPI is the result of many factors including the execution of common program constructs that change control flow such as loops, if-else branching, and function calls. The average CPI for running a set of standard benchmark programs is used to compare different architectures. CPI is a more accurate measure of the CPU’s performance as it measures its speed executing a program vs. a measure of one aspect of an individual instruction’s execution. See a computer architecture textbook1 for more details about processor performance and designing processors to improve their performance.

5.6.2. Putting It All Together: the CPU in a Full Computer

The data path (ALU, register file, and buses that connect them) and the control path (instruction execution circuitry) make up the CPU. Together they implement the processing and control parts of the von Neumann architecture. Today’s processors are implemented as digital circuits etched into silicon chips. The processor chip also includes some fast on-chip cache memory (implemented with latch storage circuits), used to store copies of recently used program data and instructions close to the processor. See the Storage and Memory Hierarchy Chapter for more information about on-chip cache memory.

Figure 9 shows an example of a processor in the context of a complete modern computer, whose components together implement the von Neumann architecture.

a CPU in a modern computer
Figure 9. The CPU in a full modern computer. Buses connect the processor chip, main memory and input and output devices.

5.6.3. Footnotes

  1. One suggestion is "Computer Architecture: A Quantitative Approach", by John Hennessy and David Patterson.