After pondering for quite a while, I came to this conclusion about the connection between CUDA threads, CUDA thread blocks, warps and execution contexts:
Warps are a part of each core and each contain a certain number of execution contexts (32 in this case). Each warp can have a thread mapped to it, but instead of mapping individual threads, we map threads in thread blocks, since that allows us to create a concept of shared thread block memory. We map our thread blocks to warps.
A sanity check: for a single V100 SM, can it support up to 64 warps * 32 = 2048 cuda threads? And on extreme cases you can map kernel<<<64 numBlocks, 32 threadsPerBlock>>> on a single SM?
This slide shows shared memory and L1 cache storage to be a single unit. Does that mean that a program that uses more of the shared memory will have a smaller L1 cache?
Please log in to leave a comment.
64KB registers per sub core, each sub subcore has 16 warps, and each warp has 32 threads, this means that each thread has 128 bytes of private memory?