How expensive are cache misses? - I - notes by alikiki

Part II: [[How expensive are cache misses? - II]] "Cache miss" is a buzzword that I've heard tossed around in low latency applications - but I'm wondering how expensive cache misses actually are. I've seen numbers like this [online](https://gist.github.com/jboner/2841832): ``` L1 cache reference 0.5 ns L2 cache reference 7 ns Main memory reference 100 ns ``` But there's a special dopamine hit from seeing theoretical numbers become viscerally real via running experiments. To give a brief overview, a cache hit is when you ask for something in memory, and it is found in the cache. Conversely, a cache miss is when your CPU has to go looking in the next level of the memory hierarchy. For example, an L1 cache hit is when what you asked for is found in L1 cache; an L1 cache miss is when it cannot be found in L1 cache, and we have to look in L2 cache. ## Experiment I'm using a 2025 Macbook Air with an M4 chip. It has a 2-level cache hierarchy: ``` hw.perflevel0.l1icachesize: 196608 bytes hw.perflevel0.l1dcachesize: 131072 bytes (128KB) hw.perflevel0.l2cachesize: 16777216 bytes (16MB) ``` We'll design our experiment like this. 1. Make `uint8` arrays of sizes varying from 128 bytes to 512 mebibytes. 2. For each array, record the time it takes to randomly access entries. Run this $10^8$ times. 3. Tabulate the average access time. ```c #define NUM_SIZES 22 #define BASE_SIZE 128 #define ITERATIONS 100000000 int main() { size_t sizes[NUM_SIZES]; for (int i = 0; i < NUM_SIZES; i++) { sizes[i] = BASE_SIZE * (1 << i); } mach_timebase_info_data_t timebase_info; mach_timebase_info(&timebase_info); double ns_per_tick = (double)timebase_info.numer / (double)timebase_info.denom; for (int s = 0; s < NUM_SIZES; s++) { size_t current_size = sizes[s]; uint8_t *array = (uint8_t *)malloc(current_size); /* ... */ for (size_t i = 0; i < current_size; i++) { array[i] = rand() % 256; } // Warm up? for (int i = 0; i < current_size; i++) { volatile uint8_t value = array[i]; } uint64_t total_time = 0; for (int i = 0; i < ITERATIONS; i++) { // Random access size_t index = rand() % current_size; uint64_t start_time = mach_absolute_time(); volatile uint8_t value = array[index]; uint64_t end_time = mach_absolute_time(); total_time += (end_time - start_time); } double avg_time = (total_time * ns_per_tick) / ITERATIONS; free(array); } return 0; } ``` Based on our mental model of how caches work, we should expect to see two jumps in the average access time -- one at the L1d cache boundary and another at the L2d boundary. However, these are the results: ![[Pasted image 20250406124352.png]] #### How come we don't see two jumps? Is our mental model wrong? Turns out that `ns_per_tick` on my Mac is `41.67ns`. That's surely not granular enough to observe the effects of an L1 cache miss. Also - with `41.67ns` granularity - how is it possible that we are getting sub-5ns average access times? Apparently, most of incremental times are zero -- meaning it takes less than 1 tick to access the random entry. This lines up with the granularity hypothesis. Let's plot the percentage of "zero access times" against the total number of iterations. ![[Pasted image 20250406124416.png]] This explains the steep jump in average access times - only 75% of the access times are less than a tick. From this, we can hypothesize that L1 and L2 access times are less than 42ns, and main memory access times are greater than 42ns. Is there a way we can get more granular timestamps than 42ns? Can we count the number of CPU cycles of L1/L2/main memory access? To be continued. ## References - [What Every Programmer Should Know About Memory](https://people.freebsd.org/~lstewart/articles/cpumemory.pdf) - [`mach_absolute_time`](https://developer.apple.com/documentation/kernel/1462446-mach_absolute_time)