12.3. Memory Considerations
Programmers should pay special attention to memory use, especially when employing memory-intensive data structures such as matrices and arrays. While compilers offer powerful optimization features, the compiler cannot always make optimizations that improve a program’s memory use. In this section, we use an implementation of a matrix-vector program (matrixVector.c) to guide discussion of techniques and tools for improving memory use.
The main()
function of the program performs two steps. First, it allocates and initializes the input matrix, the input vector, and the output matrix. Next, it
performs matrix-vector multiplication. Running the code on matrix-vector dimensions of 10,000 x 10,000 reveals that the matrixVectorMultiply()
function takes up the
majority of time:
$ gcc -o matrixVector matrixVector.c $ ./matrixVector 10000 10000 Time to allocate and fill matrices: 1.2827 Time to allocate vector: 9.6e-05 Time to matrix-vector multiply: 1.98402
Our discussion will thus focus on the matrixVectorMultiply()
function.
12.3.1. Loop interchange
Loop interchange optimizations switch the order of inner and outer loops in nested loops in order to maximize cache locality. Automatically performing this
task is difficult for compilers to do. In gcc, the -floop-interchange
compiler flag exists but is currently not available by default. Therefore, it is a good idea for
programmers to pay attention to how their code is accessing memory-composite data structures like arrays and matrices. As an example, let’s take a closer
look at the matrixVectorMultiply()
function in matrixVector.c.
Original Version (matrixVector.c) | Loop interchange version (matrixVector2.c) |
---|---|
|
|
The input and output matrices are dynamically allocated (second method discussed in C chapter). As a result, each row in the matrices are not contiguous to each other, while each element in every
row are contiguous. The current ordering of the loops causes the program to cycle through each column instead of every row. Recall that data is loaded into cache
in blocks not elements. As a result, when an element x in an array in either res
or m
is accessed, the elements adjacent to x are also loaded into cache. Cycling through every "column" of the matrix causes more cache misses, as the cache is forced to load new blocks with every access. Table 2 shows that adding optimization flags does not decrease the run-time of the function. However, simply switching the order of the loops (as shown above and in matrixVector2.c) makes the function nearly 8 times faster and allows the compiler to perform additional optimizations.
Version | Unoptimized | -O1 |
-O2 |
-O3 |
---|---|---|---|---|
Original ( |
2.01 |
2.05 |
2.07 |
2.08 |
With Loop Interchange ( |
0.27 |
0.08 |
0.06 |
0.06 |
The Valgrind tool cachegrind
(discussed in Chapter 11) is a great way to identify data locality issues, and reveals the cache access differences in
the two versions of the matrixVectorMultiply()
function shown above.
12.3.2. Some other compiler optimizations for improving locality: fission and fusion
Re-running the improved program on 10,000 x 10,000 elements yields the following run-time numbers:
$ gcc -o matrixVector2 matrixVector2.c $ ./matrixVector2 10000 10000 Time to allocate and fill matrices: 1.29203 Time to allocate vector: 0.000107 Time to matrix-vector multiply: 0.271369
Now, matrix allocation and filling takes the most time. Additional timing reveals that it is the filling of the matrices that in fact takes the most time. Let’s take a closer look at that code:
//fill matrices
for (i = 0; i < rows; i++){
fillArrayRandom(matrix[i], cols);
fillArrayZeros(result[i], cols);
}
To fill the input and output matrices, a for
-loop cycles through all the rows, and calls the fillArrayRandom()
and fillArrayZeros()
functions on each matrix. In some scenarios, it may be advantageous for the compiler to split the single loop into two separate loops (known as loop fission), as shown in Table 3.
Original Version | With loop fission |
---|---|
|
|
The process of taking two loops that operate over the same range and combining their contents into a single loop (i.e. the opposite of loop fission) is called loop fusion. Loop fission and fusion are examples of optimizations a compiler might perform to try and improve data locality. Compilers for multicore processors may also use loop fission or loop fusion to enable loops to execute efficiently on multiple cores. For example, a compiler may use loop fission to assign two loops to different cores. Likewise, a compiler may use loop fusion to combine together dependent operations into the body of the loop and distribute to each core a subset of the loop iterations (assuming data between iterations are independent).
In our case, applying loop fission manually does not directly improve program performance; there is virtually no change in the amount of time required to fill the array. However, it may make clear a more subtle optimization: the loop containing fillArrayZeros()
is not necessary. The matrixVectorMultiply()
assigns values to each element in the result
array; a prior initialization to all
zeroes is unnecessary.
Previous version (matrixVector2.c) | Updated version with calloc() (matrixVector3.c) |
---|---|
|
|
12.3.3. Memory Profiling with Massif
Making the above change results in only a slight decrease in run-time. While it eliminates the step of filling in all elements in the result matrix with zeros, a significant time is still required to fill the input matrix with random numbers:
$ gcc -o matrixVector3 matrixVector3.c $ ./matrixVector3 10000 10000 Time to allocate matrices: 0.049073 Time to fill matrices: 0.946801 Time to allocate vector: 9.3e-05 Time to matrix-vector multiply: 0.359525
Even though each array is stored non-contiguously in memory, each array takes up 10,000 x sizeof(int)
bytes, or 40,000 bytes. Since there is a total of 20,000 (10,000 each for the initial matrix and the result matrix) arrays allocated, this corresponds to 800 million bytes, or roughly 762 MB of space. Filling 762 MB with random numbers understandably takes a lot of time. With matrices, memory use
increases quadratically with the input size, and can play a large role in performance.
Valgrind’s massif
tool can help you profile memory use. Like the other Valgrind tools we covered in this book (memcheck,
cachegrind, and callgrind), massif runs as a wrapper around a program’s executable. Specifically
massif takes snapshots of program memory use throughout the program, and profiles how memory usage fluctuates. Programmers may find the massif tool useful for tracking how their programs use heap memory, and for
identifying opportunities to improve memory use. Let’s run the massif
tool on the matrixVector3
executable:
$ valgrind --tool=massif ./matrixVector3 10000 10000 ==7030== Massif, a heap profiler ==7030== Copyright (C) 2003-2015, and GNU GPL'd, by Nicholas Nethercote ==7030== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info ==7030== Command: ./matrixVector3 10000 10000 ==7030== Time to allocate matrices: 0.049511 Time to fill matrices: 4.31627 Time to allocate vector: 0.001015 Time to matrix-vector multiply: 0.62672 ==7030==
Running massif
produces a massif.out.xxxx
file, where xxxx
is a unique id number. If you are typing along, type ls
to reveal your corresponding massif file. In the example below, the
corresponding file is massif.out.7030
. Use the ms_print
command to view the massif
output:
$ ms_print massif.out.7030 -------------------------------------------------------------------------------- Command: ./matrixVector3 10000 10000 Massif arguments: (none) ms_print arguments: massif.out.7030 -------------------------------------------------------------------------------- MB 763.3^ ::::::::::::::::::::::# |:::::::::::::::::::::::::::::::::::::::::::::::::: # |: : # |@ : # |@ : # |@ : # |@ : # |@ : # |@ : # |@ : # |@ : # |@ : # |@ : # |@ : # |@ : # |@ : # |@ : # |@ : # |@ : # |@ : # 0 +----------------------------------------------------------------------->Gi 0 9.778 Number of snapshots: 80 Detailed snapshots: [3, 12, 17, 22, 49, 59, 69, 79 (peak)]
At the top of the output is the memory use graph. The x-axis shows the number instructions executed.
The y-axis shows memory use. The graph above indicates that a total of 9.778 billion (Gi) instructions
executed during our run of matrixVector3
. During execution, massif
took a total of 80 snapshots
to measure use on the heap. Memory use peaked in the last snapshot (79). Peak memory use for the
program was 763.3 MB, and stayed relatively constant throughout the program.
Summaries of all the snapshots occur after the graph. For example, the table below corresponds to the snapshots around snapshot 79:
.... -------------------------------------------------------------------------------- n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B) -------------------------------------------------------------------------------- 70 1,081,926 727,225,400 727,080,000 145,400 0 71 1,095,494 737,467,448 737,320,000 147,448 0 72 1,109,062 747,709,496 747,560,000 149,496 0 73 1,122,630 757,951,544 757,800,000 151,544 0 74 1,136,198 768,193,592 768,040,000 153,592 0 75 1,149,766 778,435,640 778,280,000 155,640 0 76 1,163,334 788,677,688 788,520,000 157,688 0 77 1,176,902 798,919,736 798,760,000 159,736 0 78 7,198,260,935 800,361,056 800,201,024 160,032 0 79 10,499,078,349 800,361,056 800,201,024 160,032 0 99.98% (800,201,024B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc. ->99.96% (800,040,000B) 0x40089D: allocateArray (in matrixVector3)
Each row corresponds to a particular snapshot, the time it was taken, the total heap memory consumption (in bytes) at that point,
the number of bytes asked by the program ("useful-heap") at that point, the number of bytes allocated in excess of what the
program asked for, and the size of the stack. By default, stack profiling is off (it slows massif
down significantly). To
enable stack profiling, use the --stacks=yes
option when running massif
.
The massif
tool reveals that 99.96% of the program’s heap memory use occurred in the allocateArray()
function, and that a total of
800 million bytes were allocated, consistent with the back-of-the-envelope calculation we performed above. Readers will likely
find massif
an useful to tool for identifying areas of high heap memory use in their programs, which often slows a program down.
For example, memory leaks can occur in programs when programmers frequently call malloc()
without calling free()
at the
first correct opportunity. The massif tool is incredibly useful for detecting such leaks.