Dive Into Systems

11.5. Cache Analysis and Valgrind

Because caches significantly influence program performance, most systems provide profiling tools to measure a program’s use of the cache. One such tool is Valgrind’s cachegrind mode, which this section uses to evaluate cache performance.

Consider the following program that generates a random N×N matrix:

#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include <time.h>

int **genRandomMatrix(int n, int max) {
    int i, j;
    int **mat = malloc(n * sizeof(int *));

    for (i = 0; i < n; i++) {
        mat[i] = malloc(n * sizeof(int));

        for (j = 0; j < n; j++) {
            mat[i][j] = 1 + rand() % max;
        }
    }

    return mat;
}

void free_all(int **mat, int n) {
    int i;

    for (i = 0; i < n; i++) {
        free(mat[i]);
    }

    free(mat);
}

int main(int argc, char **argv) {
    int i, n;
    int **matrix;

    if (argc != 2) {
        fprintf(stderr, "usage: %s <n>\n", argv[0]);
        fprintf(stderr, "where <n> is the dimension of the matrix\n");
        return 1;
    }

    n = strtol(argv[1], NULL, 10);
    srand(time(NULL));

    matrix = genRandomMatrix(n, 100);

    free_all(matrix, n);
    return 0;
}

Prior sections in this chapter introduced two functions for averaging every element of a matrix. They differ only in the way they index into the matrix:

float averageMat_v1(int **mat, int n) {
    int i, j, total = 0;

    for (i = 0; i < n; i++) {
        for (j = 0; j < n; j++) {
            // Note indexing: [i][j]
            total += mat[i][j];
        }
    }

    return (float) total / (n * n);
}

float averageMat_v2(int **mat, int n) {
    int i, j, total = 0;

    for (i = 0; i < n; i++) {
        for (j = 0; j < n; j++) {
            // Note indexing: [j][i]
            total += mat[j][i];
        }
    }

    return (float) total / (n * n);
}

This section uses cache profiling tools to quantify the differences between them.

11.5.1. A First Cut: Theoretical Analysis and Benchmarking

A theoretical analysis based on locality and the memory hierarchy suggests that the first version exhibits better spatial locality (on matrix mat) due to the fact that mat is stored in row-major order in memory. The second solution has poor spatial locality because each element in the matrix is visited in column-major order. Recall that data is loaded into a cache in blocks. Traversing the matrix in column-major order will likely lead to more cache misses, resulting in poorer performance.

Let’s modify the main function to include calls to the gettimeofday function to accurately measure the difference in performance between the two versions:

int main(int argc, char** argv) {
   /* Validate command line parameters. */
   if (argc != 2) {
       fprintf(stderr, "usage: %s <n>\n", argv[0]);
       fprintf(stderr, "where <n> is the dimension of the matrix\n");
       return 1;
   }

   /* Declare and initialize variables. */
   int i;
   float res;
   double timer;
   int n = strtol(argv[1], NULL, 10);
   srand(time(NULL));
   struct timeval tstart, tend;
   int ** matrix = genRandomMatrix(n, 100);

   /* Time version 1. */
   gettimeofday(&tstart, NULL);
   res = averageMat_v1(matrix, n);
   gettimeofday(&tend, NULL);
   timer = tend.tv_sec - tstart.tv_sec + (tend.tv_usec - tstart.tv_usec)/1.e6;
   printf("v1 average is: %.2f; time is %g\n", res, timer);

   /* Time version 2. */
   gettimeofday(&tstart, NULL);
   res = averageMat_v2(matrix, n);
   gettimeofday(&tend, NULL);
   timer = tend.tv_sec - tstart.tv_sec + (tend.tv_usec - tstart.tv_usec)/1.e6;
   printf("v2 average is: %.2f; time is %g\n", res, timer);

   /* Clean up. */
   free_all(matrix, n);
   return 0;
}

Compiling the code and running it yields the following result (note that times will vary based on the on which machine it’s run):

$ gcc -o cachex cachex.c
$ ./cachex 5000
v1 average is: 50.49; time is 0.053641
v2 average is: 50.49; time is 0.247644

That’s a big difference! In essence, the solution using row-major order is 4.61 times faster than the second one!

11.5.2. Cache Analysis in the Real World: Cachegrind

Theoretically analyzing the two solutions and then running them verifies that the first version is faster than the second. However, it doesn’t confirm the details of the cache analysis. Fortunately, the Valgrind suite of tools can help. Earlier in the book, we discussed how Valgrind can help find memory leaks in a program. This section describes Cachegrind, Valgrind’s cache simulator. Cachegrind enables a programmer to study how a program or particular function affects the cache.

Cachegrind simulates how a program interacts with the computer’s cache hierarchy. In many cases, Cachegrind can autodetect the cache organization of a machine. In the cases that it cannot, Cachegrind still simulates the first level (L1) cache and the last level (LL) cache. It assumes the first level cache has two independent components: the instruction cache and the data cache. The reason for this is that the last level cache has the most important implications for runtime. L1 caches also have the lowest level of associativity, so it’s important to ensure that programs interact well with it. These assumptions match the structure of most modern machines.

Cachegrind collects and outputs the following information:

Instruction cache reads (Ir)
L1 instruction cache read misses (I1mr) and LL cache instruction read misses (ILmr)
Data cache reads (Dr)
D1 cache read misses (D1mr) and LL cache data misses (DLmr)
Data cache writes (Dw)
D1 cache write misses (D1mw) and LL cache data write misses (DLmw)

Note that D1 total access is computed by D1 = D1mr + D1mw and LL total access is given by ILmr + DLmr + DLmw.

Let’s see how well version 1 of the code operates under Cachegrind. To run it, execute Valgrind on the compiled code with the following command:

$ valgrind --tool=cachegrind --cache-sim=yes ./cachex 1000

In this invocation, Valgrind’s cachegrind tool acts as a wrapper around the cachex executable. Choosing a smaller matrix size for Cachegrind aids in the speed of execution. Cachegrind outputs information about the number of cache hits and misses in the overall program:

==28657== Cachegrind, a cache and branch-prediction profiler
==28657== Copyright (C) 2002-2017, and GNU GPL'd by Nicholas Nethercote et al.
==28657== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info
==28657== Command: ./cachex 1000
==28657==
--28657-- warning: L3 cache found, using its data for the LL simulation.
average is: 50.49; time is 0.080304
average is: 50.49; time is 0.09733
==28657==
==28657== I   refs:      122,626,329
==28657== I1  misses:          1,070
==28657== LLi misses:          1,053
==28657== I1  miss rate:        0.00%
==28657== LLi miss rate:        0.00%
==28657==
==28657== D   refs:       75,292,076  (56,205,598 rd   + 19,086,478 wr)
==28657== D1  misses:      1,192,118  ( 1,129,099 rd   +     63,019 wr)
==28657== LLd misses:         64,399  (     1,543 rd   +     62,856 wr)
==28657== D1  miss rate:         1.6% (       2.0%     +        0.3%  )
==28657== LLd miss rate:         0.1% (       0.0%     +        0.3%  )
==28657==
==28657== LL refs:         1,193,188  ( 1,130,169 rd   +     63,019 wr)
==28657== LL misses:          65,452  (     2,596 rd   +     62,856 wr)
==28657== LL miss rate:          0.0% (       0.0%     +        0.3%  )

However, this analysis is interested specifically in the hits and misses for the two versions of this averaging function. To view that information, use the Cachegrind tool cg_annotate. Running Cachegrind should have produced a file in the current working directory that looks similar to cachegrind.out.n, where n is some process ID number. To run cg_annotate, type in the following command (replacing cachegrind.out.28657 with the name of the output file):

$ cg_annotate cachegrind.out.28657

I1 cache:         32768 B, 64 B, 8-way associative
D1 cache:         32768 B, 64 B, 8-way associative
LL cache:         8388608 B, 64 B, 16-way associative
Command:          ./cachex 1000
Data file:        cachegrind.out.28657
Events recorded:  Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Events shown:     Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Event sort order: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Thresholds:       0.1 100 100 100 100 100 100 100 100
Include dirs:
User annotated:
Auto-annotation:  off

 ----------------------------------------------------------------------------
         Ir  I1mr  ILmr         Dr      D1mr  DLmr         Dw   D1mw   DLmw
 ----------------------------------------------------------------------------
122,626,329 1,070 1,053 56,205,598 1,129,099 1,543 19,086,478 63,019 62,856  PROGRAM TOTALS

 ----------------------------------------------------------------------------
        Ir I1mr ILmr         Dr      D1mr DLmr        Dw   D1mw   DLmw  file:function
 ----------------------------------------------------------------------------
14,009,017    3    3  9,005,008    62,688    0     1,004      0      0  averageMat_v1
14,009,017    0    0  9,005,008 1,062,996    0     1,004      0      0  averageMat_v2

We’ve edited the output from this command slightly to focus on the two versions of the average function. This output shows that version 2 yields 1,062,996 data misses, compared to only 62,688 misses in version 1. Cachegrind provides solid proof that our analysis is correct!