Previous | Next --- Slide 45 of 50

jgrace

Do these operations work across cores and CUDA threads? For instance if we wanted to do some series of computations and then perform an operation on all entries in a large vector being worked on by different CUDA threads, would we need to gather all the values together into one main thread and to perform the global operation? I believe this is what PyTorch's distributed parallel library does

bmo

@jgrace, hopefully I'm not missing your point. I don't see anything stopping us from doing this operations across cores/CUDA threads. I think it just depends on how your CUDA kernel is written. We should be able to, within kernel code:

do some computations
syncthreads
do some other operations

msere

I also wonder how this would work between threads, specifically with scatter, unless its done in hardware. If a kernel does a scatter and then uses output[i], that CUDA thread won't know whether the thread responsible for populating output[i] has done so yet, and if the responsible thread was on another SM, then sync_threads wouldn't work here either. I would think its either handled by a single SM, as @jgrace brought up, or the GPU somehow blocks all threads calling gather across all SMs from continuing until they all complete their part of the gather.

Interestingly it was mentioned that SIMD also has supported scatter and gather operations to deal with non-contiguous elements.

Please log in to leave a comment.