This gets into the idea of artifactual communication that we talked about back in our message passing days. Since the granularity of caches is CACHE_LINE_SIZE (which is generally much larger than a single int), the first program will actually unintentionally put multiple counters into the same cache line. Thus, even though the threads do not share counters, they share cache lines! These shared cache lines are then subject to cache coherence protocols that can slow down the program.
parallelpower
The key idea is that Cache coherence works at the granularity of cache lines (64 bytes), not the granularity of int's (4 bytes)
This gets into the idea of artifactual communication that we talked about back in our message passing days. Since the granularity of caches is
CACHE_LINE_SIZE
(which is generally much larger than a singleint
), the first program will actually unintentionally put multiple counters into the same cache line. Thus, even though the threads do not share counters, they share cache lines! These shared cache lines are then subject to cache coherence protocols that can slow down the program.