tspint

For simultaneous multi-threading / hyper-threading (are those the same thing?), does that mean that each core must have at least two execution units? How is this an extension of superscalar rather than just simply superscalar? Why does Intel Hyper-threading specifically refer to 2 threads per core and not more?

l-henken

@tspint Multi-threading and hyper-threading are not strictly the same things. You can have multi-threading without hyper-threaded hardware. Multi-threading is the idea of interleaving threads based on the state of the threads (ie stalling/dependencies). We say that hyper-threading is an extension of superscalar because the hardware uses similar logic to find two instructions that can be run simultaneously. But the difference between the two is that superscalar execution picks the two instructions from one instruction stream whereas hyper-threading picks the two instructions from two instruction streams.

ufxela

@l-henken thanks for the explanation. what's the difference between hyperthreading and having multiple cores then? Could it be that hyperthreading automatically detects multiple instr streams (how does it do this?) whereas multiple cores does not?

l-henken

Multi-threading as a term would mean either providing the abstraction that multiple streams are executing at the same time, or they are actually executing at the same time. It for sure is confusing diction.

When you have multiple cores, you are provided "multi-threading" in the sense that multiple streams can actually run at the same time because each core has its own context, its own fetch and decodes, its own ALUs, etc. In one time slice, N threads can run on N cores.

When you have hyperthreading on a single core, you are providing "multi-threading" in the sense that the OS thinks it is running multiple streams at the same time. But a 2-way hyperthreaded core only has one execution unit (ALU) for 2 contexts, so only one context can run at a time. For this 2-way hyperthreaded core, in one time slice, N threads can run on N cores just like above, but 2N threads can be associated with those N cores. This is possible because of the two execution contexts.

Hyperthreaded does't detect multiple streams, it chooses between the two possible streams that reside in these two contexts. So the OS can schedule 2N software threads and amortize the cost of context switching. It allows the hardware to hide latency as we talked about later because if one of the contexts is waiting on memory, it can run the other context.

tspint

I was referring to simultaneous multi-threading vs. hyper-threading, which I believe is different from regular multi-threading because the processor does not interleave threads that stall

Nian

@l-henken, it seems that a 2-way hyperthreader core has 2 ALUs. Please see page 85 in this slides.

cmchiang

In the class I asked what the difference between hardware threads and software threads is. The context of hardware threads will be maintained by CPU cores and operating systems will see as if there are (#cores × #threads per core) CPUs. For example, on a myth machine, lscpu gives the following outputs: Architecture: x86_64 CPU(s): 8 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 And if you run htop, it will list 8 CPUs.

On the other hand, software threads are maintained by OS.

kayvonf

Simultaneous multi-threading involves executing instructions from two different threads in parallel on a core. In class, I mainly described and illustrated interleaved multi-threading, where each clock the core chooses one runnable thread and executes the next instruction in that thread's instruction stream using the core's execution resources. However I do provide an illustration of simultaneous multi-threading on slide 86. See the bonus slides sequence starting on slide 82 for more detail.

To fast-forward a bit in the lecture, a more modern NVIDIA GPU, the GTX 1080 (see slide 62) is able to maintain state for up to 64 execution contexts (called "warps" in NVIDIA-speak) on its "SM" cores, and each clock it chooses up to four of those 64 threads to execute instructions from. Those four threads execute simultaneously on the core using four different sets of execution resources. So there is interleaved multi-threading in that the chip interleaves up to 64 execution contexts, and simultaneous multi-threading in that it chooses up to four of those contexts to run each clock. (And if you look carefully at the slide I linked to, there is also superscalar execution in that the core will try and run up to two independent instructions for each of those four warps -- up to a total of eight overall -- each clock. (For those interested in the nitty-gritty: in the GTX 1080 example, the second instruction that can be run simultaneously from the same thread needs to be a non-arithmetic instruction... I didn't show load/store units in the figure.)

Intel's Hyper-threading implementation makes sense if you consider the context: Intel had spent years building superscalar processors that could perform a number of different instructions per clock (within a single instruction stream). But as we discussed, it's not always possible for one instruction stream to have the right mixture of independent instructions to utilize all the available units in the core (this is the case of insufficient ILP). Therefore, it's a logical step to say, hey, to increase the CPU's chance of finding the right mix, let's modify our processor to have two threads available to choose instructions from instead of one!

kayvonf

Of course, running two threads is not always better than one, since these threads might thrash each other's data in the cache resulting in more cache misses that ultimately cause far more stalls than Hyper-Threading could ever hope to fill. On the other hand, running two threads at once can also be beneficial in terms of cache behavior if the threads access similar data. One thread might access address X, bringing it into cache. Then, if X is accessed by the other thread for the first time, what normally would have been a cold miss in a single thread system turns out to be a cache hit!

Finally, it might also be instructive for students to note that the motivation for adding multi-threading to an Intel CPU (called hyper-threading) was different from the motivation for large-scale multi-threading in a GPU. GPUs feature many execution contexts for the primary purpose of hiding memory latency. Intel HyperThreading isn't really intended to hide all memory latency (it only has two threads, and that's not enough to hide the long latencies of out to memory). Instead, Intel HyperThreading exists to make it easier for the core's scheduler to find enough independent instructions to fill the multiple ALUs in a modern superscalar Intel CPU. In other words, the thinking was: if there is insufficient ILP in one thread to occupy all the ALUs, why not keep two threads around to draw instructions from. Of course, multi-threading does hide some latency when one thread stalls on memory access, but unlike GPUs, CPUs multi-threading is not intended to hide a significant fraction of memory latency.

I refer you to the slides at the end of this lecture for more illustrations of these concepts.

Another good example of a multi-threaded processor, where support of multiple threads was intended to hide memory access latency, is the UltraSPARC T2 chip, which features eight threads per core. An academic paper about T2 is here.

A historical note. The UltraSPARC T1 and T2 chips, were also referred to as "Niagara", and you may recognize the name of one of the architects of that original chip.