Previous | Next --- Slide 51 of 73

chii

Does automatic optimization / generation exist for parallel computing platforms that we have covered, like CUDA or OpenMP?

pmp

I have the same question as chii.

Also, it seems amazing that the Halide compiler can search over this space of schedules to try to predict the best schedule. How does it do that? Is it possible because there are so few scheduling directives in Halide that there are not thaaat many permutations (other than determining a good number of threads, chunksize etc)?

Would it be possible to design a different kind of scheduling syntax that's easier for more programmers to use? It seems like if there are only a handful of people in the world who can write schedules, then maybe Halide's scheduling interface isn't an optimal design. Or are schedules just a really hard problem?

a7hu

@chii There is no auto-generation for CUDA AFAIK. The scheduling of CUDA kernels from host device is done and memcpy are done via CUDA streams. By properly using the CUDA stream, a user can enable to kernels to run concurrently or hide the latency of memcpy and kernel launches. The efficient use of CUDA stream depends heavily on user knowing the behavior of the program. For instance, to allow two kernels to run in parallel in the SMs, these two kernels must not have dependency on each other. This knowledge would be extremely difficult for the compiler to figure out. The closet thing to auto-scheduling CUDA kernels, memset, and memcpy is CUDA graphs. It allows user to specify the dependencies between kernels/memsets/memcpies via a dependency graph like what we did in Programming Assignment 2. Once user presents the dependency graph to CUDA, CUDA is responsible for resolving the dependencies and efficiently scheduling the nodes in the dependency graph.

Please log in to leave a comment.