## 7.8. Matrices

A matrix is a 2D array. A matrix in C can be statically allocated as a 2-dimensional array (`M[n][m]`), dynamically allocated with a single call to `malloc()`, or dynamically allocated as an array of arrays. Let’s consider the array of arrays implementation. The first array contains `n` elements (`M[n]`), and each element `M[i]` in our matrix contains an array of `m` elements. The following code snippets each declare matrices of size 4 x 3:

``````//statically-allocated matrix (allocated on stack)
int M1;

//dynamically-allocated matrix (programmer friendly, allocated on heap)
int **M2, i;
M2 = malloc(4 * sizeof(int*));
for (i = 0; i < 4; i++) {
M2[i] = malloc(3 * sizeof(int));
}``````

In the case of the dynamically allocated matrix, the main array contains a contiguous array of `int` pointers. Each integer pointer points to a different array in memory. Figure 1 illustrates how we would normally visualize each of these matrices: Figure 1. Illustration of a statically-allocated (M1) and dynamically-allocated (M2) 3x4 matrix.

For both of the above matrix declarations, element (i,j) can be accessed using the double indexing syntax `M[i][j]`, where `M` is either `M1` or `M2`. However, the organization of these matrices are different in memory. While both store the elements in their primary array contiguously in memory, our statically allocated matrix also stores all the rows contiguously in memory: Figure 2. Matrix M1’s memory layout in row-major order.

This contiguous ordering is not guaranteed for M2. Recall that to contiguously allocate an n x m matrix on the heap, we should use a single call to `malloc()` that allocates n x m elements:

``````//dynamic matrix (allocated on heap, memory efficient way)
#define ROWS 4
#define COLS 3
int *M3;
M3 = malloc(ROWS*COLS*sizeof(int));``````

Recall that with the declaration of `M3`, element (i,j) cannot be accessed using the `M[i][j]` notation. Instead, we must index the element using the format: `M3[i*COLS + j]`.

### 7.8.1. Contiguous Two Dimensional Arrays

Consider a function `sumMat()` that takes a pointer to a contiguously allocated (either statically allocated or a memory efficient dynamically allocated) matrix as its first parameter, along with a number of rows and columns and returns the sum of all the elements inside the matrix.

We use scaled indexing in the code snippet below, since it applies to both statically and dynamically allocated contiguous matrices. Recall that the syntax `m[i][j]` does not work with the memory efficient contiguous dynamic allocation previously discussed.

``````int sumMat(int *m, int rows, int cols) {
int i, j, total = 0;
for (i = 0; i < rows; i++){
for (j = 0; j < cols; j++){
total += m[i*cols + j];
}
}
}``````

Here is the corresponding assembly. Each line is annotated with its English translation:

```Dump of assembler code for function sumMat:
0x400686 <+0>:     push   %rbp                 # save rbp
0x400687 <+1>:     mov    %rsp,%rbp            # update rbp (new stack frame)
0x40068a <+4>:     mov    %rdi,-0x18(%rbp)     # copy m to %rbp-0x18
0x40068e <+8>:     mov    %esi,-0x1c(%rbp)     # copy rows to %rbp-0x1c
0x400691 <+11>:    mov    %edx,-0x20(%rbp)     # copy cols parameter to %rbp-0x20
0x400694 <+14>:    movl   \$0x0,-0x4(%rbp)      # copy 0 to %rbp-0x4 (total)
0x40069b <+21>:    movl   \$0x0,-0xc(%rbp)      # copy 0 to %rbp-0xc (i)
0x4006a2 <+28>:    jmp    0x4006e1 <sumMat+91> # goto <sumMat+91>
0x4006a4 <+30>:    movl   \$0x0,-0x8(%rbp)      # copy 0 to %rbp-0x8 (j)
0x4006ab <+37>:    jmp    0x4006d5 <sumMat+79> # goto <sumMat+79>
0x4006ad <+39>:    mov    -0xc(%rbp),%eax      # copy i to %eax
0x4006b0 <+42>:    imul   -0x20(%rbp),%eax     # multiply i with cols, place in %eax
0x4006b4 <+46>:    mov    %eax,%edx            # copy i*cols to %edx
0x4006b6 <+48>:    mov    -0x8(%rbp),%eax      # copy j to %eax
0x4006b9 <+51>:    add    %edx,%eax            # add i*cols with j, place in %eax
0x4006bb <+53>:    cltq                        # convert %eax to a 64-bit int
0x4006bd <+55>:    lea    0x0(,%rax,4),%rdx    # multiply (i*cols+j) by 4,put in %rdx
0x4006c5 <+63>:    mov    -0x18(%rbp),%rax     # copy m to %rax
0x4006c9 <+67>:    add    %rdx,%rax            # add m to (i*cols+j)*4, place in %rax
0x4006cc <+70>:    mov    (%rax),%eax          # copy m[i*cols+j] to %eax
0x4006d5 <+79>:    mov    -0x8(%rbp),%eax      # copy j to %eax
0x4006d8 <+82>:    cmp    -0x20(%rbp),%eax     # compare j with cols
0x4006db <+85>:    jl     0x4006ad <sumMat+39> # if (j < cols) goto <sumMat+39>
0x4006e1 <+91>:    mov    -0xc(%rbp),%eax      # copy i to %eax
0x4006e4 <+94>:    cmp    -0x1c(%rbp),%eax     # compare i with rows
0x4006e7 <+97>:    jl     0x4006a4 <sumMat+30> # if (i < rows) goto <sumMat+30>
0x4006e9 <+99>:    mov    -0x4(%rbp),%eax      # copy total to %eax
0x4006ec <+102>:   pop    %rbp                 # clean up stack

Local variables `i`, `j`, and `total` are loaded at addresses `%rbp-0xc`, `%rbp-0x8` and `%rbp-0x4` on the stack, respectively. Input parameters `m`, `row`, and `cols` are stored at locations `%rbp-0x8`, `%rbp-0x1c`, and `%rbp-0x20` respectively. Using this knowledge, let’s zoom in on the component that just deals with the access of element (i,j) in our matrix:

```0x4006ad <+39>:    mov    -0xc(%rbp),%eax      # copy i to %eax
0x4006b0 <+42>:    imul   -0x20(%rbp),%eax     # multiply i with cols, place in %eax
0x4006b4 <+46>:    mov    %eax,%edx            # copy i*cols to %edx```

The first set of instructions shown above calculates the value `i*cols` and places it in register `%edx`. Recall that for a matrix called `matrix`, `matrix+i*cols` is equivalent to `&matrix[i]`.

```0x4006b6 <+48>:    mov    -0x8(%rbp),%eax      # copy j to %eax
0x4006b9 <+51>:    add    %edx,%eax            # add i*cols with j, place in %eax
0x4006bb <+53>:    cltq                        # convert %eax to a 64-bit int
0x4006bd <+55>:    lea    0x0(,%rax,4),%rdx    # multiply (i*cols+j) by 4,put in %rdx```

The next set of instructions computes `(i*cols+j)*4`. The compiler multiplies the index `i*cols+j` by 4 since each element in the matrix is a 4-byte integer and this multiplication enables the compiler to compute the correct offset. The `cltq` instruction on line `<sumMat+53>` is needed to sign-extend the contents of `%eax` into a 64-bit integer, since that is about to be used for address calculation.

Next, the following set of instructions adds the calculated offset to the matrix pointer and dereferences it to yield the value of element (i,j):

```0x4006c5 <+63>:    mov    -0x18(%rbp),%rax     # copy m to %rax
0x4006c9 <+67>:    add    %rdx,%rax            # add m to (i*cols+j)*4, place in %rax
0x4006cc <+70>:    mov    (%rax),%eax          # copy m[i*cols+j] to %eax

The first instruction loads the address of matrix `m` into register `%rax`. The `add` instruction adds `(i*cols + j)*4` to the address of `m` to correctly calculate the offset of element (i,j). The third instruction dereferences the address in `%rax` and places the value in `%eax`. Notice the use of `%eax` as the destination component register; since our matrix contains integers, and integers take up 4 bytes of space, component register %eax is again used instead of `%rax`.

The last instruction adds the value in `%eax` to the accumulator `total`, which is located at stack address `%rbp-0x4`.

Let’s consider how element (1,2) is accessed in Figure 2. For convenience, the figure is reproduced below: Figure 3. Matrix M1’s memory layout in row-major order.

Element (1,2) is located at address `M1 + 1*COLS + 2`. Since `COLS` = `3`, element (1,2) corresponds to `M1+5`. To access the element at this location, the compiler must multiply `5` by the size of the `int` data type (`4` bytes), yielding the offset `M1+20`, which corresponds to byte x20 in the figure. Dereferencing this location yields element 5, which is indeed element (1,2) in the matrix.

### 7.8.2. Non-contiguous Matrix

The non-contiguous matrix implementation is a bit more complicated. Figure 4 visualizes how `M2` may be laid out in memory: Figure 4. Matrix M2’s non-contiguous layout in memory.

Notice that the array of pointers is contiguous, and that each array pointed to by an element of `M2` (e.g. `M2[i]`) is contiguous. However, the individual arrays are not contiguous with each other. Since `M2` is an array of pointers, each element of `M2` takes 8 bytes of space. In contrast, since `M2[i]` is an `int` array, each element of `M2[i]` is 4 bytes away.

The `sumMatrix()` function below takes an array of integer pointers (called `matrix`) as its first parameter, and a number of rows and columns as its second and third parameters:

``````int sumMatrix(int **matrix, int rows, int cols) {
int i, j, total=0;
for (i = 0; i < rows; i++) {
for (j = 0; j < cols; j++) {
total += matrix[i][j];
}
}
}``````

While this function looks nearly identical to the `sumMat()` function shown earlier, the matrix accepted by this function consists of a contiguous array of pointers. Each pointer contains the address of a separate contiguous array, which corresponds to a separate row in the matrix.

The corresponding assembly for `sumMatrix()` follows. Each line is annotated with its English translation.

```Dump of assembler code for function sumMatrix:
0x4006ee <+0>:     push   %rbp                     # save rbp
0x4006ef <+1>:     mov    %rsp,%rbp                # update rbp (new stack frame)
0x4006f2 <+4>:     mov    %rdi,-0x18(%rbp)         # copy matrix to %rbp-0x18
0x4006f6 <+8>:     mov    %esi,-0x1c(%rbp)         # copy rows to %rbp-0x1c
0x4006f9 <+11>:    mov    %edx,-0x20(%rbp)         # copy cols to %rbp-0x20
0x4006fc <+14>:    movl   \$0x0,-0x4(%rbp)          # copy 0 to %rbp-0x4 (total)
0x400703 <+21>:    movl   \$0x0,-0xc(%rbp)          # copy 0 to %rbp-0xc (i)
0x40070a <+28>:    jmp    0x40074e <sumMatrix+96>  # goto <sumMatrix+96>
0x40070c <+30>:    movl   \$0x0,-0x8(%rbp)          # copy 0 to %rbp-0x8 (j)
0x400713 <+37>:    jmp    0x400742 <sumMatrix+84>  # goto <sumMatrix+84>
0x400715 <+39>:    mov    -0xc(%rbp),%eax          # copy i to %eax
0x400718 <+42>:    cltq                            # convert i to 64-bit integer
0x40071a <+44>:    lea    0x0(,%rax,8),%rdx        # multiply i by 8, place in %rdx
0x400722 <+52>:    mov    -0x18(%rbp),%rax         # copy matrix to %rax
0x400726 <+56>:    add    %rdx,%rax                # add i*8 to matrix, place in %rax
0x400729 <+59>:    mov    (%rax),%rax              # copy matrix[i] to %rax (pointer)
0x40072c <+62>:    mov    -0x8(%rbp),%edx          # copy j to %edx
0x40072f <+65>:    movslq %edx,%rdx                # convert j to a 64-bit integer
0x400732 <+68>:    shl    \$0x2,%rdx                # multiply j by 4, place in %rdx
0x400736 <+72>:    add    %rdx,%rax                # add j*4 to matrix[i], put in %rax
0x400739 <+75>:    mov    (%rax),%eax              # copy matrix[i][j] to %eax
0x400742 <+84>:    mov    -0x8(%rbp),%eax          # copy j to %eax
0x400745 <+87>:    cmp    -0x20(%rbp),%eax         # compare j with cols
0x400748 <+90>:    jl     0x400715 <sumMatrix+39>  # if (j < cols) goto <sumMatrix+39>
0x40074e <+96>:    mov    -0xc(%rbp),%eax          # copy i to %eax
0x400751 <+99>:    cmp    -0x1c(%rbp),%eax         # compare i with rows
0x400754 <+102>:   jl     0x40070c <sumMatrix+30>  # if (i < rows) goto <sumMatrix+30>
0x400756 <+104>:   mov    -0x4(%rbp),%eax          # copy total to %eax
0x400759 <+107>:   pop    %rbp                     # restore %rbp

Once again, variables `i`, `j`, and `total` are at stack addresses `%rbp-0xc`, `%rbp-0x8` and and `%rbp-0x4` respectively. Input parameters `matrix`, `row`, and `cols` are located at stack addresses `%rbp-0x18`, `%rbp-0x1c`, and `%rbp-0x20` respectively. Let’s zoom in on the section that deals specifically with an access to element (i,j), or `matrix[i][j]`:

```0x400715 <+39>:    mov    -0xc(%rbp),%eax          # copy i to %eax
0x400718 <+42>:    cltq                            # convert i to 64-bit integer
0x40071a <+44>:    lea    0x0(,%rax,8),%rdx        # multiply i by 8, place in %rdx
0x400722 <+52>:    mov    -0x18(%rbp),%rax         # copy matrix to %rax
0x400726 <+56>:    add    %rdx,%rax                # add i*8 to matrix, place in %rax
0x400729 <+59>:    mov    (%rax),%rax              # copy matrix[i] to %rax (pointer)```

The five instructions above compute `matrix[i]`, or `*(matrix+i)`. Since `matrix[i]` contains a pointer, `i` is first converted to a 64-bit integer. Then, the compiler multiplies `i` by 8 prior to adding it to `matrix` to calculate the correct address offset (recall that pointers are 8 bytes in size). The instruction at `<sumMatrix+59>` then dereferences the calculated address to get the element `matrix[i]`.

Since `matrix` is an array of `int` pointers, the element located at `matrix[i]` is itself an `int` pointer. The jth element in `matrix[i]` is located at offset `j*4` in the `matrix[i]` array.

The next set of instructions extract the jth element in array `matrix[i]`:

```0x40072c <+62>:    mov    -0x8(%rbp),%edx          # copy j to %edx
0x40072f <+65>:    movslq %edx,%rdx                # convert j to a 64-bit integer
0x400732 <+68>:    shl    \$0x2,%rdx                # multiply j by 4, place in %rdx
0x400736 <+72>:    add    %rdx,%rax                # add j*4 to matrix[i], put in %rax
0x400739 <+75>:    mov    (%rax),%eax              # copy matrix[i][j] to %eax
The first instruction in this snippet loads variable `j` into register `%edx`. The `movslq` instruction at `<sumMatrix+65>` converts `%edx` into a 64-bit integer, storing the result in 64-bit register `%rdx`. The compiler then uses the left shift (`shl`) instruction to multiply `j` by 4 and stores the result in register `%rdx`. The compiler finally adds the resulting value to the address located in `matrix[i]` to get the address of element `matrix[i][j]`. The instructions at `<sumMatrix+75>` and `<sumMatrix+77>` obtain the value at `matrix[i][j]` and add the value to `total`. Note that `M2` starts at memory location x0. The compiler first computes the address of `M2` by multiplying 1 by 8 (`sizeof(int *)`) and adding it to the address of `M2` (x0), yielding the new address x8. A dereference on this address yields the address associated with `M2`, or x36. The compiler then multiplies index 2 by 4 (`sizeof(int)`), and adds the result (8) to x36, yielding a final address of x44. The address x44 is dereferenced, yielding the value 5. Sure enough, the element in Figure 4 that corresponds to `M2` has the value 5.