11.5. Cache Analysis and Valgrind
Because caches significantly influence program performance, most systems
provide profiling tools to measure a program’s use of the cache. One such tool
is Valgrind’s cachegrind
mode, which this section uses to evaluate cache
performance.
Consider the following program that generates a random N×N matrix:
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include <time.h>
int **genRandomMatrix(int n, int max) {
int i, j;
int **mat = malloc(n * sizeof(int *));
for (i = 0; i < n; i++) {
mat[i] = malloc(n * sizeof(int));
for (j = 0; j < n; j++) {
mat[i][j] = 1 + rand() % max;
}
}
return mat;
}
void free_all(int **mat, int n) {
int i;
for (i = 0; i < n; i++) {
free(mat[i]);
}
free(mat);
}
int main(int argc, char **argv) {
int i, n;
int **matrix;
if (argc != 2) {
fprintf(stderr, "usage: %s <n>\n", argv[0]);
fprintf(stderr, "where <n> is the dimension of the matrix\n");
return 1;
}
n = strtol(argv[1], NULL, 10);
srand(time(NULL));
matrix = genRandomMatrix(n, 100);
free_all(matrix, n);
return 0;
}
Prior sections in this chapter introduced two functions for averaging every element of a matrix. They differ only in the way they index into the matrix:
|
|
This section uses cache profiling tools to quantify the differences between them.
11.5.1. A First Cut: Theoretical Analysis and Benchmarking
A theoretical analysis based on locality and the memory hierarchy suggests that
the first version exhibits better spatial locality (on matrix mat
) due to
the fact that mat
is stored in
row-major order in memory. The
second solution has poor spatial locality because each element in the matrix is
visited in column-major order. Recall that data is loaded into a cache in
blocks. Traversing the matrix in column-major order will likely lead to more
cache misses, resulting in poorer performance.
Let’s modify the main function to include calls to the gettimeofday
function to accurately measure the difference in performance between the two
versions:
int main(int argc, char** argv) {
/* Validate command line parameters. */
if (argc != 2) {
fprintf(stderr, "usage: %s <n>\n", argv[0]);
fprintf(stderr, "where <n> is the dimension of the matrix\n");
return 1;
}
/* Declare and initialize variables. */
int i;
float res;
double timer;
int n = strtol(argv[1], NULL, 10);
srand(time(NULL));
struct timeval tstart, tend;
int ** matrix = genRandomMatrix(n, 100);
/* Time version 1. */
gettimeofday(&tstart, NULL);
res = averageMat_v1(matrix, n);
gettimeofday(&tend, NULL);
timer = tend.tv_sec - tstart.tv_sec + (tend.tv_usec - tstart.tv_usec)/1.e6;
printf("v1 average is: %.2f; time is %g\n", res, timer);
/* Time version 2. */
gettimeofday(&tstart, NULL);
res = averageMat_v2(matrix, n);
gettimeofday(&tend, NULL);
timer = tend.tv_sec - tstart.tv_sec + (tend.tv_usec - tstart.tv_usec)/1.e6;
printf("v2 average is: %.2f; time is %g\n", res, timer);
/* Clean up. */
free_all(matrix, n);
return 0;
}
Compiling the code and running it yields the following result (note that times will vary based on the on which machine it’s run):
$ gcc -o cachex cachex.c $ ./cachex 5000 v1 average is: 50.49; time is 0.053641 v2 average is: 50.49; time is 0.247644
That’s a big difference! In essence, the solution using row-major order is 4.61 times faster than the second one!
11.5.2. Cache Analysis in the Real World: Cachegrind
Theoretically analyzing the two solutions and then running them verifies that the first version is faster than the second. However, it doesn’t confirm the details of the cache analysis. Fortunately, the Valgrind suite of tools can help. Earlier in the book, we discussed how Valgrind can help find memory leaks in a program. This section describes Cachegrind, Valgrind’s cache simulator. Cachegrind enables a programmer to study how a program or particular function affects the cache.
Cachegrind simulates how a program interacts with the computer’s cache hierarchy. In many cases, Cachegrind can autodetect the cache organization of a machine. In the cases that it cannot, Cachegrind still simulates the first level (L1) cache and the last level (LL) cache. It assumes the first level cache has two independent components: the instruction cache and the data cache. The reason for this is that the last level cache has the most important implications for runtime. L1 caches also have the lowest level of associativity, so it’s important to ensure that programs interact well with it. These assumptions match the structure of most modern machines.
Cachegrind collects and outputs the following information:
-
Instruction cache reads (
Ir
) -
L1 instruction cache read misses (
I1mr
) and LL cache instruction read misses (ILmr
) -
Data cache reads (
Dr
) -
D1 cache read misses (
D1mr
) and LL cache data misses (DLmr
) -
Data cache writes (
Dw
) -
D1 cache write misses (
D1mw
) and LL cache data write misses (DLmw
)
Note that D1 total access is computed by D1 = D1mr + D1mw
and LL total access
is given by ILmr + DLmr + DLmw
.
Let’s see how well version 1 of the code operates under Cachegrind. To run it, execute Valgrind on the compiled code with the following command:
$ valgrind --tool=cachegrind --cache-sim=yes ./cachex 1000
In this invocation, Valgrind’s cachegrind
tool acts as a wrapper around
the cachex
executable. Choosing a smaller matrix size for Cachegrind aids
in the speed of execution. Cachegrind outputs information about the number of
cache hits and misses in the overall program:
==28657== Cachegrind, a cache and branch-prediction profiler ==28657== Copyright (C) 2002-2017, and GNU GPL'd by Nicholas Nethercote et al. ==28657== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info ==28657== Command: ./cachex 1000 ==28657== --28657-- warning: L3 cache found, using its data for the LL simulation. average is: 50.49; time is 0.080304 average is: 50.49; time is 0.09733 ==28657== ==28657== I refs: 122,626,329 ==28657== I1 misses: 1,070 ==28657== LLi misses: 1,053 ==28657== I1 miss rate: 0.00% ==28657== LLi miss rate: 0.00% ==28657== ==28657== D refs: 75,292,076 (56,205,598 rd + 19,086,478 wr) ==28657== D1 misses: 1,192,118 ( 1,129,099 rd + 63,019 wr) ==28657== LLd misses: 64,399 ( 1,543 rd + 62,856 wr) ==28657== D1 miss rate: 1.6% ( 2.0% + 0.3% ) ==28657== LLd miss rate: 0.1% ( 0.0% + 0.3% ) ==28657== ==28657== LL refs: 1,193,188 ( 1,130,169 rd + 63,019 wr) ==28657== LL misses: 65,452 ( 2,596 rd + 62,856 wr) ==28657== LL miss rate: 0.0% ( 0.0% + 0.3% )
However, this analysis is interested specifically in the hits and misses for the two
versions of this averaging function. To view that information, use the
Cachegrind tool cg_annotate
. Running Cachegrind should have produced a file
in the current working directory that looks similar to cachegrind.out.n
,
where n
is some process ID number. To run cg_annotate
, type in the following
command (replacing cachegrind.out.28657
with the name of the output file):
$ cg_annotate cachegrind.out.28657 I1 cache: 32768 B, 64 B, 8-way associative D1 cache: 32768 B, 64 B, 8-way associative LL cache: 8388608 B, 64 B, 16-way associative Command: ./cachex 1000 Data file: cachegrind.out.28657 Events recorded: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw Events shown: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw Event sort order: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw Thresholds: 0.1 100 100 100 100 100 100 100 100 Include dirs: User annotated: Auto-annotation: off ---------------------------------------------------------------------------- Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw ---------------------------------------------------------------------------- 122,626,329 1,070 1,053 56,205,598 1,129,099 1,543 19,086,478 63,019 62,856 PROGRAM TOTALS ---------------------------------------------------------------------------- Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw file:function ---------------------------------------------------------------------------- 14,009,017 3 3 9,005,008 62,688 0 1,004 0 0 averageMat_v1 14,009,017 0 0 9,005,008 1,062,996 0 1,004 0 0 averageMat_v2
We’ve edited the output from this command slightly to focus on the two
versions of the average function. This output shows that
version 2 yields 1,062,996 data misses, compared to only 62,688
misses in
version 1. Cachegrind provides solid proof that our analysis is correct!