How expensive are cache misses? - II - notes by alikiki

Part I: [[How expensive are cache misses? - I]] Last week, I asked [[How expensive are cache misses? - I]] - and ran into a small problem with the granularity of my measurement tool. Apparently, on my 2025 Macbook Air, I can only measure in `41.67ns` increments. ## Why is the time granularity `41.67ns`? `41.67ns` is oddly specific and seems somewhat arbitrary. Where does the number come from? `hw.tbfrequency` gives the timebase frequency of my processor. Running `sysctl hw.tbfrequency` spits out `24000000`, which means that my RTC ticks 24 million times per second - which is another way of saying it ticks every `41.67ns`. That's one easy puzzle out of the way. ## Can we actually observe L1 cache misses somehow? I started up Instruments on my Mac, and set about measuring the performance counters. If we were running Linux, then we could just fire up `perf` - but I'm too lazy to install Linux at the moment. I selected three counters: `L1D_CACHE_MISS_LD`, `L1D_CACHE_MISS_NONSPEC`, and `FIXED_CYCLES`. - `L1D_CACHE_MISS_LD` = Loads that miss the L1 data cache. - `L1D_CACHE_MISS_LD_NONSPEC` = Retired loads that miss the L1 data cache. - `FIXED_CYCLES` = Number of CPU cycles. ### What is a "retired load"? A retired load is a load instruction that has retired. A retired instruction is one that has been completed / executed. Apparently, not all CPU instructions are executed. The CPU will perform "speculative execution", which is an optimization technique where the CPU guesses what work it will need to perform and executes before it knows if the work is needed or not. Instructions that were guessed correctly - as in the work was actually needed - are called "retired instructions". In the context of L1 data cache misses, `L1D_CACHE_MISS_LD` includes speculative load instructions that missed the L1 cache, while `L1D_CACHE_MISS_NONSPEC` counts the subset of `L1D_CACHE_MISS_LD` instructions that were actually necessary load instructions. ### Results from Instruments Using the disassembler, I identified the instruction that corresponds to the load command in my code: ```c for (size_t i = 0; i < 1000000; i++) { size_t index = rand() % current_size; volatile uint8_t value = array[index]; // here } ``` ```asm +0xc4 ldr x8, [sp, #0x18] +0xc8 mov x9, #0x4240 +0xcc movk x9, #0xf, lsl #16 +0xd0 subs x8, x8, x9 +0xd4 cset w8, hs +0xd8 tbnz w8, #0x0, "main+0x128" +0xdc b "main+0xe0" +0xe0 bl "DYLD-STUB$rand" +0xe4 mov x8, x0 +0xe8 sxtw x8, w8 +0xec ldur x10, [x29, #-0x10] +0xf0 udiv x9, x8, x10 +0xf4 mul x9, x9, x10 +0xf8 subs x8, x8, x9 +0xfc str x8, [sp, #0x10] +0x100 ldur x8, [x29, #-0x18] +0x104 ldr x9, [sp, #0x10] +0x108 add x8, x8, x9 ; calculate memory address +0x10c ldrb w8, [x8] ; load from memory +0x110 strb w8, [sp, #0xf] +0x114 b "main+0x118" +0x118 ldr x8, [sp, #0x18] +0x11c add x8, x8, #0x1 +0x120 str x8, [sp, #0x18] +0x124 b "main+0xc4" ``` We should expect cache misses increase at `+0x10c` as the size of the array exceeds the capacity of L1d cache. One other consideration to make - unfortunately, there isn't a deterministic way to pin cores on Mac OS. Based on the topology of my machine, measuring L1d cache misses only really make sense when running things on a single core: ``` L2 L#0 (4096KB) --> efficiency cores L1d L#0 (64KB) + L1i L#0 (128KB) + Core L#0 + PU L#0 (P#0) L1d L#1 (64KB) + L1i L#1 (128KB) + Core L#1 + PU L#1 (P#1) L1d L#2 (64KB) + L1i L#2 (128KB) + Core L#2 + PU L#2 (P#2) L1d L#3 (64KB) + L1i L#3 (128KB) + Core L#3 + PU L#3 (P#3) L1d L#4 (64KB) + L1i L#4 (128KB) + Core L#4 + PU L#4 (P#4) L1d L#5 (64KB) + L1i L#5 (128KB) + Core L#5 + PU L#5 (P#5) L2 L#1 (16MB) --> performance cores L1d L#6 (128KB) + L1i L#6 (192KB) + Core L#6 + PU L#6 (P#6) L1d L#7 (128KB) + L1i L#7 (192KB) + Core L#7 + PU L#7 (P#7) L1d L#8 (128KB) + L1i L#8 (192KB) + Core L#8 + PU L#8 (P#8) L1d L#9 (128KB) + L1i L#9 (192KB) + Core L#9 + PU L#9 (P#9) ``` I had to spam the recordings until I got a run that ran on only one core (this is something that I didn't care to do for [[How expensive are cache misses? - I]]). Using array size 128: ![[Pasted image 20250413134616.png]] Using array size 4096: ![[Pasted image 20250413134800.png]] Using array size 66536: ![[Pasted image 20250413134856.png]] Using array size 131072: ![[Pasted image 20250413135013.png]] Using array size 262144: ![[Pasted image 20250413135118.png]] We can see that as the array size increases, we can see an increase in the percentage of L1D cache misses (both speculative and retired). Pretty cool! ### Other observations #### What's going on with `DYLD-STUB$rand`? How does it use L1d cache? My hypothesis is that our array is competing with any internal state used by `rand()` for L1d cache space. We see cache misses for `rand()` blow up only after our array size surpasses 131,072 bytes. If I remove the array accesses from the code, then we see zero cache misses coming from `rand()`: ![[Pasted image 20250413141026.png]] What internal state does `rand()` actually use? It looks like `rand()` actually uses `arc4random()`, which uses a 256-byte array internally. This array is *probably* competing for cache resources against our array here. Another exploration for the backlog.