The left program has worse performance due to false sharing: different element of the array counter may reside in the same cache line.
sagoyal
I was having a hard to time understanding how this content overlapped with what Kunle was talking about with shared memory bank conflicts, and I realized in CUDA these two issues can actually overlap. This video (at 11:47) provides a good explanation about how padding can reduce bank conflicts with padding.
The left program has worse performance due to false sharing: different element of the array counter may reside in the same cache line.