leave

How do hardware companies decide how many cores/SIMD/threads etc they want on their computers. Sound like a lot of parameters to tune

lindenli

I quite like this slide as a summary of all the different types of parallelism that happens. There are 16 cores, meaning 16 different instruction streams, with 8 SIMD ALUs per code, meaning that we can run 16 * 8 = 128 operations in total. Each core has 4 execution contexts for four threads. This is a powerful multi-core chip, but an important takeaway is that there's a lot of room for low utilization, which sort of destroys the point.

matteosantamaria

It's really mind-blowing to consider how many sophisticated optimizations are going on, all beyond eye-sight, when you run a program. I am also curious about how one actually uses SIMD technology. Do you have you write your program specifically for SIMD-enabled hardware?

To make sure I'm understanding this slide, we have 16 cores which is how we arrive at 16 simultaneous instruction streams (1-to-1 mapping between cores and simultaneous instruction streams). And then each core can support 4 threads, which is how we get 64 concurrent instruction streams?

kayvonf

@matteosantamaria. I think the answer to your question about SIMD will become very evident after you take on assignment 1.

You are correct in your comments about the number simultaneously executing and concurrently "live" hardware threads.

apappu

Here, how do we get to the number of 512 independent pieces of work for max utilization?

My understanding here was we have 16 independent instruction streams that can execute in a given clock, and each has 8 ALUs, so if each clock cycle is able to execute a SIMD instruction on 8 pieces of data, we have full util, but that's only 16 * 8 = 128.

I assume what I am missing is the above calculation I think assumes no latency issues -- but if we assume that the hardware has to be able to switch to a thread that has useful work to do at some point, because the current thread will eventually hit some latency, then the above calculation holds for each hardware thread, meaning we go from 128 --> 128 * 4 = 512. Is that the correct logic?

juliewang

@apappu, your second calculation looks right. Since most threads will at some point need to access memory, here we are assuming that four threads per core will achieve 100% utilization on that core. So, 16 cores * 4 threads per core * 8 ALU's/core = 512 independent pieces of work.

What's key is the final statement, "512 independent pieces of work are needed to run chip with maximal latency hiding ability"

finn

Are "cores" and "processors" used interchangeably?

martigp

Yes I believe "cores" are used interchangeably with "processors", this was addressed in Tuesday week 2's lecture. In Professor Kayvon's slides, everything contained within the grey boxes are cores / processors.