Previous | Next --- Slide 38 of 88

tyler.johnson

This type of design sticks out to me as something that could be a major sticking point in modern applications. Specifically, your last line ("poorly written code might execute at 1/32 the peak capability of the machine") makes me wonder what we can do to make this less problematic. If somebody has some formal training or is working on a team where standards can be enforced I'm sure this becomes less an issue, but what about for enthusiasts or small (perhaps startup) teams who are attempting to innovate with less formal practice or money? Is there any chance of something like a language or framework that is written in a way that enforces best practices for extracting max utilization (perhaps something like a modern graphics engine or AI package already does this)?

a7hu

I believe that the term "implicit SIMD" is called "SIMT" execution model on GPU. SIMT refers to single instruction multiple threads. In example code, execute(my_function, N) means executing the instruction sequence of my_function on different parts of an output data in N threads.

jessiexu

@tyler.johnson there are many existing libraries optimized and verified for GPUs. For example, tensorflow uses cuDNN library. If library is not an option, a profiler is necessary to find the bottleneck of a program. Nvidia visual profiler can detect the branch divergence.

Please log in to leave a comment.