Dive Into Systems

13.1. How the OS Works and How It Runs

Part of the job of the OS is to support programs running on the system. To start a program running on a computer, the OS allocates a portion of RAM for the running program, loads the program’s binary executable from disk into RAM, creates and initializes OS state for the process associated with this running program, and initializes the CPU to start executing the process’s instructions (e.g., the CPU registers need to be initialized by the OS to fetch and execute the process’s instructions). Figure 1 illustrates these steps.

Figure 1. Steps the OS takes to start a new program running on the underlying hardware

Like user programs, the OS is also software that runs on the computer hardware. The OS, however, is special system software that manages all system resources and implements the interface for users of the computer system; it is necessary for using the computer system. Because the OS is software, its binary executable code runs on the hardware just like any other program: its data and instructions are stored in RAM and its instructions are fetched and executed by the CPU just like user’s program instructions are. As a result, for the OS to run, its binary executable needs to be loaded into RAM and the CPU initialized to start running OS code. However, because the OS is responsible for the task of running code on the hardware, it needs some help to get started running.

13.1.1. OS Booting

The process of the OS loading and initializing itself on the computer is known as booting — the OS "pulls itself up by its bootstraps", or boots itself on the computer. The OS needs a little help to initially get loaded onto the computer and to begin running its boot code. To initiate the OS code to start running, code stored in computer firmware (nonvolatile memory in the hardware) runs when the computer first powers up; BIOS (Basic Input/Output System) and UEFI (Unified Extensible Firmware Interface) are two examples of this type of firmware. On power-up, BIOS or UEFI runs and does just enough hardware initialization to load the first chunk of the OS (its boot block) from disk into RAM and to start running boot block instructions on the CPU. Once the OS starts running, it loads the rest of itself from disk, discovers and initializes hardware resources, and initializes its data structures and abstractions to make the system ready for users.

13.1.2. Getting the OS to Do Something: Interrupts and Traps

After the OS finishes booting and initializing the system for use, it then just waits for something to do. Most operating systems are implemented as interrupt-driven systems, meaning that the OS doesn’t run until some entity needs it to do something — the OS is woken up (interrupted from its sleep) to handle a request.

Devices in the hardware layer may need the OS to do something for them. For example, a network interface card (NIC) is a hardware interface between a computer and a network. When the NIC receives data over its network connection, it interrupts (or wakes up) the OS to handle the received data (see Figure 2). For example, the OS may determine that the data received by the NIC is part of a web page that was requested by a web browser; it then delivers the data from the NIC to the waiting web browser process.

Requests to the OS also come from user applications when they need access to protected resources. For example, when an application wants to write to a file, it makes a system call to the OS, which wakes up the OS to perform the write on its behalf (see Figure 2). The OS handles the system call by writing the data to a file stored on disk.

Interrupts to the OS are from the hardware layer and Traps are from the user/program layer

Figure 2. In an interrupt-driven system, user-level programs make system calls, and hardware devices issue interrupts to initiate OS actions.

Interrupts that come from the hardware layer, such as when a NIC receives data from the network, are typically referred to as hardware interrupts, or just interrupts. Interrupts that come from the software layer as the result of instruction execution, such as when an application makes a system call, are typically referred to as traps. That is, a system call "traps into the OS", which handles the request on behalf of the user-level program. Exceptions from either layer may also interrupt the OS. For example, a hard disk drive may interrupt the OS if a read fails due to a bad disk block, and an application program may trigger a trap to the OS if it executes a divide instruction that divides by zero.

System calls are implemented using special trap instructions that are defined as part of the CPU’s instruction set architecture (ISA). The OS associates each of its system calls with a unique identification number. When an application wants to invoke a system call, it places the desired call’s number in a known location (the location varies according to the ISA) and issues a trap instruction to interrupt the OS. The trap instruction triggers the CPU to stop executing instructions from the application program and to start executing OS instructions that handle the trap (run the OS trap handler code). The trap handler reads the user-provided system call number and executes the corresponding system call implementation.

Here’s an example of what a write system call might look like on an IA32 Linux system:

/* C code */
ret = write(fd, buff, size);

# IA32 translation
write:

...            # set up state and parameters for OS to perform write
movl $4, %eax  # load 4 (unique ID for write) into register eax
int  $0x80     # trap instruction: interrupt the CPU and transition to the OS
addl $8, %ebx  # an example instruction after the trap instruction

The first instruction (movl $4, %eax) puts the system call number for write (4) into register eax. The second instruction (int $0x80) triggers the trap. When the OS trap handler code runs, it uses the value in register eax (4) to determine which system call is being invoked and runs the appropriate trap handler code (in this case it runs the write handler code). After the OS handler runs, the OS continues the program’s execution at the instruction right after the trap instruction (addl in this example).

Unlike system calls, which come from executing program instructions, hardware interrupts are delivered to the CPU on an interrupt bus. A device places a signal, typically a number indicating the type of interrupt, on the CPU’s interrupt bus (see Figure 3). When the CPU detects the signal on its interrupt bus, it stops executing the current process’s instructions and starts executing OS interrupt handler code. After the OS handler code runs, the OS continues the process’s execution at the application instruction that was being executed when the interrupt occurred.

Figure 3. A hardware device (disk) sends a signal to the CPU on the interrupt bus to trigger OS execution on its behalf.

If a user program is running on the CPU when an interrupt (or trap) occurs, the CPU runs the OS’s interrupt (or trap) handler code. When the OS is done handling an interrupt, it resumes executing the interrupted user program at the point it was interrupted.

Because the OS is software, and its code is loaded into RAM and run on the CPU just like user program code, the OS must protect its code and state from regular processes running in the system. The CPU helps by defining two execution modes:

In user mode a CPU executes only user-level instructions and accesses only the memory locations that the operating system makes available to it. The OS typically prevents a CPU in user mode from accessing the OS’s instructions and data. User mode also restricts which hardware components the CPU can directly access.
In kernel mode, a CPU executes any instructions and accesses any memory location (including those that store OS instructions and data). It can also directly access hardware components and execute special instructions.

When OS code is run on the CPU, the system runs in kernel mode, and when user-level programs run on the CPU, the system runs in user mode. If the CPU is in user mode and receives an interrupt, the CPU switches to kernel mode, fetches the interrupt handler routine, and starts executing the OS handler code. In kernel mode, the OS can access hardware and memory locations that are not allowed in user mode. When the OS is done handling the interrupt, it restores the CPU state to continue executing user-level code at the point at which the program left off when interrupted and returns the CPU back to user mode (see Figure 4).

Figure 4. The CPU and interrupts. User code running on the CPU is interrupted (at time X on the time line), and OS interrupt handler code runs. After the OS is done handling the interrupt, user code execution is resumed (at time Y on the time line).

In an interrupt-driven system, interrupts can happen at any time, meaning that the OS can switch from running user code to interrupt handler code at any machine cycle. One way to efficiently support this execution context switch from user mode to kernel mode, is to allow the kernel to run within the execution context of every process in the system. At boot time, the OS loads its code at a fixed location in RAM that is mapped into the top of the address space of every process (see Figure 5), and initializes a CPU register with the starting address of the OS handler function. On an interrupt, the CPU switches to kernel mode and executes OS interrupt handler code instructions that are accessible at the top addresses in every process’s address space. Because every process has the OS mapped to the same location at the top of its address space, the OS interrupt handler code is able to execute quickly in the context of any process that is running on the CPU when an interrupt occurs. This OS code can be accessed only in kernel mode, protecting the OS from user-mode accesses; during regular execution a process runs in user mode and cannot read or write to the OS addresses mapped into the top of its address space.

The OS is mapped into every process address space

Figure 5. Process address space: the OS kernel is mapped into the top of every process’s address space.

Although mapping the OS code into the address space of every process results in fast kernel code execution on an interrupt, many modern processors have features that expose vulnerabilities to kernel protections when the OS is mapped into every process like this. As of the January 2018 announcement of the Meltdown hardware exploit¹, operating systems are separating kernel memory and user-level program memory in ways that protect against this exploit, but that also result in less efficient switching to kernel mode to handle interrupts.

13.1.3. References

Meltdown and Spectre. https://meltdownattack.com/