Previous | Next --- Slide 80 of 82

tyler.johnson

I'm curious what the cost of having these separated memory spaces is relative to what we are used to in more traditional CPU multithreading. If things need to be sent over to GPU specific memory then it seems like this memory throughput would be a big potential bottleneck. Additionally, who is controlling this mapping logic? Is there a scheduling algorithm running on the CPU that is then handling the mapping to GPU cores and thus some of the memory mapping?

haofeng

For execusion, CUDA employs the idea of breaking the problems into blocks of thread for SPMD processing. The programmer define the number of threads in a thread block, and the thread blocks assigned to the scheduler could run in any order. Unlike ISPC, there is no concept of a "gang", or the CUDA counterpart "warp", during CUDA programming. However, the programmer should avoid SPMD divergence as much as possible to increase computation utilization.

There are memory units at different level (thread-level, block-level etc.) for faster memory access. Different threads in a block can cooperate on writing to the block-level memory. After a thread block finishes processing, the variables stored in the block-level memory will be available for the next blocks to process. One thing I'm wondering is that how the CUDA scheduler organizes the communication between the blocks.

ChrisGabor

I've noticed deep learning frameworks like Tensorflow give me an error when I try to run gradients on 100s of images that my device memory has been exceeded. It appears that the gradients must all stay in device memory. I thought it would make more sense to transfer the gradients to host memory, but I imagine that would be a huge bottleneck for GPU and likely make training a model memory bound instead of compute bound.

Please log in to leave a comment.