potato

How does the processor maintain 100% utilization in this example when switching threads? Doesn't it take time to switch out the TLB etc?

kostun

i think each "execution context" would keep track of its own TLB. when the core switches between contexts 1, 2, 3, or 4, the core only has to switch which context it is currently "pointing" to.

nickbowman

One question that came to my mind when we were going over this slide in class related to how a chip designer might make optimal design tradeoffs between # of hardware threads within a core and how that might relate to the memory latency time of accessing cache/DDR/etc. For example, in this diagram, we conveniently see that the process run time neatly is about 1/4th the memory latency (stall time) for the system, which is what allows us to cycle between the 4 hardware threads with 100% utilization. If the memory latency time were longer, we would ideally want to have more hardware threads, but if the memory latency time were slower, we might actually be substantially delaying the time for any one thread to complete. How are these tradeoffs taken into consideration when designing a chip?

ishangaur

Upvote @nickbowman

tp

@nickbowman I think the trade off is between longer memory latency + more threads to hide the total latency vs. more expensive, but much shorter, memory latency. Ideally, I think we would want short memory latency, but due to the cost cheaper processors are going to need to go with the long memory latency + many threads route.

arkhan

I asked this in OH, posting resolution here: Q: Since ALUs are cheap, why not just add ALUs and fetch/decode blocks to each thread to make each hardware thread a full-fledged core? A: ALUs are cheap but fetch/decode is very expensive; having a hardware thread gives a significant improvement for only the cost of duplicating registers/context. The space that would have been allocated towards fetch/decode could then been used for cache or other more impactful things, depending on the design objectives.

jessiexu

@potato switching threads are done by storing and fetching states in registers for hardware threads, not in memory. TLB is just another cache. I do not get why TLB needs to be switched out.

potato

@jessiexu The TLB is often flushed when switching between processes so that malicious programs can't access the memory pages of another process. If the threads belong to the same process then they will share the same virtual address space, in which case TLB won't need to be flushed.