Slide View : Parallel Programming

Previous | Next --- Slide 57 of 86

kayvonf

Hopefully the following is a helpful summary:

Simultaneous multi-threading involves executing instructions from two different threads in parallel on a core. In class, I mainly described interleaved multi-threading, where each clock the core chooses one runnable thread and executes the next instruction in that thread's instruction stream using the core's execution resources. However I do provide an illustration of simultaneous multi-threading on slide 84.

To fast-forward a bit in the lecture, a more modern NVIDIA GPU, the GTX 1080 (see slide 61) is able to maintain state for up to 64 execution contexts (called "warps" in NVIDIA-speak) on its cores, and each clock it chooses up to four of those 64 threads to execute instructions from. Those four threads execute simultaneously on the core using four different sets of execution resources. So there is interleaved multi-threading in that the chip interleaves up to 64 execution contexts, and simultaneous multi-threading in that it chooses up to four of those contexts to run each clock. (And if you look carefully at the slide I linked to, there is also super-scalar execution in that the core will try and run up to two independent instructions for each of those four warps -- up to a total of eight overall -- each clock. For those interested in more detail: the other instruction needs to be a non arithmetic instruction, I didn't show load/store units in the figure.)

Intel's Hyper-threading implementation makes sense if you consider the context: Intel had spent years building superscalar processors that could perform a number of different instructions per clock (within a single instruction stream). But as we discussed, it's not always possible for one instruction stream to have the right mixture of independent instructions to utilize all the available units in the core (this is the case of insufficient ILP). Therefore, it's a logical step to say, hey, to increase the CPU's chance of finding the right mix, let's modify our processor to have two threads available to choose instructions from instead of one!

Of course, running two threads is not always better than one, since these threads might thrash each other's data in the cache resulting in more cache misses that ultimately cause far more stalls than Hyper-Threading could ever hope to fill. On the other hand, running two threads at once can also be beneficial in terms of cache behavior if the threads access similar data. One thread might access address X, bringing it into cache. Then, if X is accessed by the other thread for the first time, what normally would have been a cold miss in a single thread system turns out to be a cache hit!

So to summarize: Intel processors that support Hyper-threading maintain two execution contexts (hardware threads) on chip at once. Each clock, the chip looks at the two available contexts, and tries to find a mixture of runnable instructions that best utilizes all the execution units the core has available. In rare cases, one thread might sufficient ILP to fill to consume the whole capability of the chip, and if so, the chip may just run instructions from that one thread. In this case, the chip is essentially behaving like a processor performing interleaved multi-threading.

Finally, it might also be instructive for students to note that the motivation for adding multi-threading in an Intel CPU (called hyper-threading) was different from the motivation for large-scale multi-threading in a GPU. GPUs feature many execution contexts for the primary purpose of hiding memory latency. Intel HyperThreading isn't really intended to hide all memory latency (it only has two threads, and that's not enough to hide the long latencies of out to memory). Instead, Intel HyperThreading exists to make it easier for the core's scheduler to find enough independent instructions to fill the multiple ALUs in a modern superscalar Intel CPU. In other words, the thinking was: if there is insufficient ILP in one thread to occupy all the ALUs, why not keep two threads around to draw instructions from. Of course, multi-threading does hide some latency when one thread stalls on memory access, but unlike GPUs, CPUs multi-threading is not intended to hide a significant fraction of memory latency.

rrastogi

This slide claims that the processor, and not the operating system, decides which thread to run in a hardware-supported multi-threading scheme. Why would this be preferable? In this scheme, it would be impossible (or at least harder) to support thread priorities. In addition, thread scheduling logic seems complicated enough that it seems better for the OS to do it, as opposed to dedicating more hardware resources for the processor to.

kayvonf

@rrastogi. Because this switching between hardware threads occurs at the granularity of a few cycles. Note that in this lecture, multi-threading is choosing which of the available hardware resident threads should have the next instruction run by the processor core.

Involving the OS to perform a thread context switch... which involves replacing the OS thread that is assigned to a specific hardware execution context. This is an operation that involves copying the register state of the thread out to memory, and swapping the register state of the new OS thread in. This is an operation that may take thousands of cycles. The cost of the OS swapping threads is far greater than the latency of the memory stalls hardware multi-threading is designed to hide.