8.1. Diving into Assembly: Basics
For a first look at assembly, we modify the adder()
function from the
beginning of the chapter to simplify its behavior. The modified function
(adder2()
) is shown below:
#include <stdio.h>
//adds two to an integer and returns the result
int adder2(int a) {
return a + 2;
}
int main(){
int x = 40;
x = adder2(x);
printf("x is: %d\n", x);
return 0;
}
To compile this code, use the command:
$ gcc -m32 -o modified modified.c
The -m32
flag tells gcc to compile the code to a 32-bit executable. Forgetting
to include this flag may result in assembly that is wildly different from the
examples shown in this chapter; by default, gcc compiles to x86-64 assembly, the 64-bit
variant of x86. However, virtually all 64-bit architectures have a 32-bit operating
mode for backwards-compatibility. This chapter covers IA32; other chapters
cover x86-64 and ARM. Despite its age, IA32 is still extremely useful for understanding
how programs work and how to optimize code.
Next, let’s view the corresponding assembly of this code by typing the following command:
$ objdump -d modified > output $ less output
Search for the code snippet associated with adder2()
by typing /adder2
while examining
the file output
using less
. The section associated with adder2()
should look
similar to the following:
adder2()
function0804840b <adder2>: 804840b: 55 push %ebp 804840c: 89 e5 mov %esp,%ebp 804840e: 8b 45 08 mov 0x8(%ebp),%eax 8048411: 83 c0 02 add $0x2,%eax 8048414: 5d pop %ebp 8048415: c3 ret
Don’t worry if you don’t understand what’s going on just yet. We will cover assembly in greater detail in future sections. For now, we will study the structure of these individual instructions.
Each line above contains an instruction’s address in
program memory, the bytes corresponding to the instruction, and the plain-text
representation of the instruction itself. For example 55
is the machine code
representation of the instruction push %ebp
and the instruction occurs at address
804840b
in program memory.
It is important to note that a single line of C code often translates to multiple
instructions in assembly. The operation a + 2
is represented by the two instructions
mov 0x8(%ebp), %eax
and add $0x2, %eax
.
Your assembly may look different!
If you are compiling your code along with us, you may notice that some of your assembly examples look different from what is shown in this book. The precise assembly instructions that are outputted by any compiler depends on that compiler’s version and the underlying operating system. Most of the assembly examples in this book were generated on systems running Ubuntu or Red Hat Enterprise Linux (RHEL). In the examples that follow, we do not use any optimization flags. For example,
we compile any example file ( |
8.1.1. Registers
Recall that a register is a word-sized storage unit located directly on the CPU. There may be separate registers for data, instructions and addresses. For example, the Intel CPU has 8 total registers for storing 32-bit data. They are:
%eax
, %ebx
, %ecx
, %edx
, %edi
, %esi
, %esp
, and %ebp
.
Programs can read from or write to all eight of the eight listed registers.
The first six registers all hold general-purpose data, while the last two
are typically reserved by the compiler to hold address data. While a program may interpret
a general-purpose register’s contents as integers or as addresses, the register itself makes no
distinction. The last two registers (%esp
and %ebp
) are known as the
stack pointer and the frame pointer respectively. The compiler reserves the
%esp
and %ebp
registers for operations that maintain the layout of the program
stack. Typically, %esp
points to the top of the program stack, while %ebp
points to the
base of the current stack frame. We discuss stack frames and these two registers
in greater detail in our discussion on
functions.
The last register worth mentioning is %eip
or the instruction pointer
(sometimes called the program counter (PC)). It points to the next
instruction to be executed by the CPU. Unlike the eight registers mentioned
previously, programs cannot write directly to register %eip
.
8.1.2. Advanced register notation
For the first six registers mentioned (%eax
, %ebx
, %ecx
, %edx
, %edi
, and
%esi
), the ISA provides a mechanism to access the lower 16 bits of each
register. The ISA also provides a separate mechanism to access the 8-bit components
of the lower 16-bits of the first four listed registers. Table 1 lists each
of the primary 8 registers and the ISA mechanisms (if available) of accessing their
component bytes:
32-bit register (bits 31-0) | Lower 16 bits (bits 15-0) | Bits 15-8 | Bits 7-0 |
---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||
|
|
The lower 16 bits for any of the aforementioned registers can be accessed
by referencing the last two letters in a register’s name. For example, use %ax
to
access the lower 16 bits of %eax
.
The higher and lower bytes within the lower 16 bits of the first four listed registers can be accessed
by taking the last two letters of the register name and replacing the last letter
with either an h
(for higher) or an l
(for lower) depending on which byte
is desired. For example, %al
references the lower 8-bits of register %ax
while %ah
references the higher 8-bits of register %ax
. These 8-bit registers
are commonly used by the compiler for storing single byte values for certain operations,
such as bitwise shifts (a 32-bit register cannot be shifted more than 32 places
and the number 32 requires only a single byte of storage). In general, the compiler will
use the smallest component register needed to complete an operation.
8.1.3. Instruction Structure
Each instruction consists of an operation code (or opcode) that specifies what it
does and one or more operands that tells the instruction how to do it. For example,
the instruction add $0x2, %eax
has the opcode add
and the operands $0x2
and %eax
.
Each operand corresponds to a source or destination location for a specific operation. There are multiple types of operands:
-
Constant (literal) values are preceded by the
$
sign. For example in the instructionadd $0x2, %eax
,$0x2
is a literal value that corresponds to the hexadecimal value0x2
. -
Register forms refer to individual registers. The instruction
add $0x2, %eax
specifies register%eax
as the destination location where the result of theadd
operation will be stored. -
Memory forms correspond to some value inside of main memory (RAM) and are commonly used for address lookups. Memory address forms can contain a combination of registers and constant values. For example, in the instruction
mov 0x8(%ebp),%eax
, the operand0x8(%ebp)
is an example of a memory form. It loosely translates to "add0x8
to the value in register%ebp
, and then perform a memory lookup." If this sounds like a pointer dereference, that is because it is!
8.1.4. An example with operands
The best way to explain operands in detail is to present a quick example. Suppose memory contains the following values:
Address | Value |
---|---|
0x804 |
0xCA |
0x808 |
0xFD |
0x80c |
0x12 |
0x810 |
0x1E |
Let’s also assume the following registers contain values:
Register | Value |
---|---|
%eax |
0x804 |
%ebx |
0x10 |
%ecx |
0x4 |
%edx |
0x1 |
Then the operands in Table 2 evaluate to the values shown below. Each row of the table matches an operand with its form (e.g. constant, register, memory), how it is translated, and its value:
Operand | Form | Translation | Value |
---|---|---|---|
%ecx |
Register |
%ecx |
0x4 |
(%eax) |
Memory |
M[%eax] or M[0x804] |
0xCA |
$0x808 |
Constant |
0x808 |
0x808 |
0x808 |
Memory |
M[0x808] |
0xFD |
0x8(%eax) |
Memory |
M[%eax + 8] or M[0x80c] |
0x12 |
(%eax, %ecx) |
Memory |
M[%eax + %ecx] or M[0x808] |
0xFD |
0x4(%eax, %ecx) |
Memory |
M[%eax + %ecx + 4] or M[0x80c] |
0x12 |
0x800(,%edx,4) |
Memory |
M[0x800 + %edx*4] or M[0x804] |
0xCA |
(%eax, %edx, 8) |
Memory |
M[%eax + %edx*8] or M[0x80c] |
0x12 |
In Table 2, the notation %ecx
indicates value stored in register %ecx. In contrast, M[%eax]
indicates that the
value inside %eax
should be treated as an address, and to dereference (look up) the value at that address. Therefore, the operand (%eax)
corresponds to
M[0x804]
which corresponds to the value 0xCA
.
A few important notes before continuing. While Table 2 shows many valid operand forms, not all forms can be used interchangeably in all circumstances.
Specifically:
-
Constant forms cannot serve as destination operands
-
Memory forms cannot both serve as the source and destination operand in a single instruction
-
In cases of scaling operations (see the last two operands), the scaling factor must be a multiple of 2.
Table 2 is provided as a reference; however, digesting key operand forms will help improve the reader’s speed in parsing assembly language.
8.1.5. Instruction Suffixes
In several cases in upcoming examples, common and arithmetic instructions have a suffix that indicates the size (associated with the type) of the data being operated on at the code level. The compiler automatically translates code to instructions with the appropriate suffix. Table 3 shows the common suffixes for x86 instructions.
Suffix | C Type | Size (bytes) |
---|---|---|
b |
|
1 |
w |
|
2 |
l |
|
4 |
Please note that instructions involved with conditional execution have different suffixes based on the evaluated condition. We cover instructions associated with conditional instructions in a later section.