how is atomic add implemented? How many clocks does an atomic add take?
nickbowman
This is not reasonable CUDA code to run on a single code GPU that only has resources for one thread block per core. If thread block 0 runs first then there are no issues – the value gets properly incremented/set before thread block 1 runs and evaluates its while loop condition to be complete. However, if thread block 1 runs first, then we have a major problem because it will enter its while loop and never exit. Because threads in CUDA are not preemptible, thread block 0 will never get a chance to run and set the flag, causing thread block 1 to be stuck in its while loop forever.
how is atomic add implemented? How many clocks does an atomic add take?