Dynamic Programming & Spatial Locality

Introduction

Computers use a hierarchy of memory systems to balance speed and capacity. The fastest memory (cache) sits closest to the CPU but is limited in size. When data isn't found in cache, it must be fetched from slower memory, creating a "cache miss" that significantly slows execution. In this article we will explore how spatial locality can have significant impact on the performance of your algorithm leveraging dynamic programming techniques.

Glossary

L1 Cache (D1): Smallest, fastest cache directly connected to the CPU
Last-Level Cache (LL): Larger but slower cache before accessing main memory
Cache Lines: Data is loaded in fixed-size blocks (typically 64 bytes), not individual values
Spatial Locality means that after accessing one memory location, programs are likely to access nearby locations soon, and therefore, having these locations in the cache can speed up the operations.

The Two DP Approaches

Our example calculates Fibonacci numbers using two approaches:

Memoization: Stores previously calculated values in an array but accesses them in a somewhat random pattern through recursion.

 long long fib_memoization_array(int n, std::vector<long long>& memo) {
     if (n <= 1) return n;

     if (memo[n] != -1) {
         return memo[n];
     }

     memo[n] = fib_memoization_array(n-1, memo) + fib_memoization_array(n-2, memo);
     return memo[n];
 }

Tabulation: Builds the solution incrementally, accessing memory in a highly sequential pattern

 long long fib_tabulation(int n) {
     if (n <= 1) return n;

     std::vector<long long> dp(n+1, 0);
     dp[0] = 0;
     dp[1] = 1;

     // Sequential access pattern - excellent spatial locality
     for (int i = 2; i <= n; i++) {
         dp[i] = dp[i-1] + dp[i-2];
     }

     return dp[n];
 }

Both solve the same problem, but their memory access patterns differ dramatically.

Setup

For our cache performance experiments, we're using Docker to run Valgrind's cachegrind tool on macOS with an M3 chip. Since Valgrind does not support Apple Silicon chips, therefore, we used Docker to spin up a linux instance.

Analyzing Cache Performance

Memoization

root@62969aff3a5f:/workspace# valgrind --main-stacksize=10000000000 --tool=cachegrind \
                                ./access_test memoization

Time taken for 1 runs: 20404 ms
==224== 
==224== I   refs:      4,251,823,292
==224== I1  misses:            2,032
==224== LLi misses:            1,674
==224== I1  miss rate:          0.00%
==224== LLi miss rate:          0.00%
==224== 
==224== D   refs:      2,950,712,351  (1,600,511,544 rd   + 1,350,200,807 wr)
==224== D1  misses:      187,518,436  (   87,515,735 rd   +   100,002,701 wr)
==224== LLd misses:      187,505,766  (   87,504,607 rd   +   100,001,159 wr)
==224== D1  miss rate:           6.4% (          5.5%     +           7.4%  )
==224== LLd miss rate:           6.4% (          5.5%     +           7.4%  )
==224== 
==224== LL refs:         187,520,468  (   87,517,767 rd   +   100,002,701 wr)
==224== LL misses:       187,507,440  (   87,506,281 rd   +   100,001,159 wr)
==224== LL miss rate:            2.6% (          1.5%     +           7.4%  )

Tabulation

root@62969aff3a5f:/workspace# valgrind --tool=cachegrind ./access_test tabulation
Time taken for 1 runs: 1620 ms
==225== 
==225== I   refs:      751,823,239
==225== I1  misses:          2,031
==225== LLi misses:          1,673
==225== I1  miss rate:        0.00%
==225== LLi miss rate:        0.00%
==225== 
==225== D   refs:      250,712,360  (100,511,549 rd   + 150,200,811 wr)
==225== D1  misses:     25,018,742  (     15,990 rd   +  25,002,752 wr)
==225== LLd misses:     25,010,403  (      8,689 rd   +  25,001,714 wr)
==225== D1  miss rate:        10.0% (        0.0%     +        16.6%  )
==225== LLd miss rate:        10.0% (        0.0%     +        16.6%  )
==225== 
==225== LL refs:        25,020,773  (     18,021 rd   +  25,002,752 wr)
==225== LL misses:      25,012,076  (     10,362 rd   +  25,001,714 wr)
==225== LL miss rate:          2.5% (        0.0%     +        16.6%  )

Let's examine the cachegrind output:

Execution Time Comparison

Memoization: 20,404 ms
Tabulation: 1,620 ms (12.6× faster!)

Cache Statistics Breakdown

Metric	Memoization	Tabulation	Difference
Instructions	4.25 billion	751 million	5.7× fewer
Data references	2.95 billion	250 million	11.8× fewer
D1 cache misses	187.5 million	25 million	7.5× fewer
LL cache misses	187.5 million	25 million	7.5× fewer

While tabulation shows a higher D1 miss rate percentage (10.0% vs 6.4%), but look at the absolute number of cache misses is dramatically lower because it performs far fewer operations overall. There is a ~7.5x increased in the D1 cache misses in the memoization approach, which can create a big difference highly latency sensitive applications.

The Spatial Locality Advantage

Tabulation outperforms memoization because:

Sequential Access Pattern: Tabulation accesses array elements in order (dp[0], dp[1], dp[2]...), which perfectly aligns with how cache lines are loaded and thus eliminates redundant calculations, requiring far fewer instructions and memory accesses.
More Efficient Cache Usage: When a cache line is loaded for dp[i], it also contains dp[i+1], dp[i+2], etc., which will be used immediately afterward, whereas in memoization, this does not hold true, because of the recurrence.

Cache Misses Explained

In the memoization approach, the recursive calls create a less predictable access pattern. When calculating fib(n), the function might jump to calculating fib(n-2) before fib(n-1) is fully resolved, causing scattered memory access that defeats the cache's prefetching mechanisms.

Each cache miss can cost 100-300 CPU cycles compared to 1-3 cycles for a cache hit. With memoization generating 7.5× more cache misses, the performance impact is enormous.

Conclusion

This example perfectly demonstrates why understanding spatial locality matters for performance-critical code. Though both algorithms have the same mathematical complexity, the memoization approach's inferior cache utilisation and extensive call stack usage makes it over 12 times slower.

When designing algorithms, especially for dynamic programming problems, consider not just the time complexity but also how your data access patterns interact with the cache. Sequential access patterns will almost always outperform scattered access, particularly for large datasets that exceed cache size.

P.S. For those wondering why the cover photo of the article features lilies, here’s an extra byte for you: Lilies have a Fibonacci number of petals!