ruilin

I came across this concept of arithmetic intensity, which essentially quantifies the total number of arithmetic operations on each piece of data (say one float). Programs with higher arithmetic intensities can potentially have higher GPU utilization, because in this case compute, instead of memory bandwidth is the bottleneck. For example, when multiplying two n-by-n matrices, each number in the input matrices are used about n times, whereas in element-wise operations each input is only used once (low arithmetic intensity). I think this is why GPUs are particularly suitable for deep learning (each fully connected layer is a matrix-matrix multiplication).

suninhouse

How is 45 TB/sec exactly computed?

nickbowman

The GPU can do 2560 multiplication operations every clock cycle, and there are 1.6 billion clock cycles per second (1.6GHz clock), which means that the GPU can do 2560 x 1.6B = 4.096 trillion multiplications/second. In order to do a multiplication operation, you need to process 12 bytes from memory (read 4 bytes for A[i], read 4 bytes for B[i], and write 4 bytes for C[i]) so to keep the processor busy you would need to be pulling 12 bytes * 4.096 multiplications/second = ~49 trillion bytes every second from memory. Thus is what dictates that you would need ~45TB/sec of memory bandwidth in order to use the GPU at full efficiency.

mattlkf

The GTX 1080 has 2560 CUDA cores, hence 2560 MULs/clk if every "core" does one mul per clock.

The bandwidth needed is 12 bytes / mul * 2560 muls/clock * 1.6 x 10^9 clocks/sec, for a total of 4.9 x 10^13 bytes/sec. This works out to 44.7 TBytes/sec (don't forget, as I did initially, that 1TByte is 1024^4 bytes and not 10^12 bytes)!

(Repost from Piazza)

Drew

I think Kayvon's explanation of this thought experiment that set the latency to 0 really demonstrated the difference between bandwidth and latency well. It is very often to confuse latency and bandwidth, and I have confused them before. However, when you think what happens in the limiting case when latency is set to 0 or when bandwidth is set to 0, it is much easier to see what each mean. In this thought experiment, if we set latency to 0 (from DRAM or cache or disk or whatnot), we are still at about 1% utilization of the GPU, because we can't shove enough data into the processor fast enough. The first bit gets to the GPU instantaneously, but each GB still takes some time to be fed to the processor. This is independent of latency, which is cool to see.

ysp

I guess this issue is almost inevitable in the above example but we typically do much more operations after loading inputs to GPUs. So it can be leveraged in many cases by paying more attention to the code. Unfortunately, many blind ML framework users who don't know well about GPU computing don't do this, which leads to a very low 'Volatile GPU util' when you type nvidia-smi.

xhe17

The example mentioned in this page, P65, is related to two concepts mentioned in P69, Bandwidth Bounded Application and Arithmetic Intensity. This application is not a Bandwidth Bounded Application, since the bandwidth needed to load the two vectors A and B are much smaller compared to the bandwidth available. The Arithmetic Intensity, which is how many operations are needed relative to the amount of memory needed to do those operations, is very low in this application, which results in low GPU efficiency.

lonelymoon

Can we derive the number of cuda cores, 2560, from the values in the lecture note?

lonelymoon

Also how is "4.2x faster" computed?

kayvonf

@lonelymoon -- since the computation is bandwidth bound, rate at which each machine can perform instructions is given by the bandwidth of the machine. (Recall bandwidth bound means that the process is just always "waiting" on the next piece of data it needs to arrive. Here the GPU in question has 320 GB/sec of bandwidth, and the CPU in question is attached to a memory system with 76 GB/sec of bandwidth: 320/76 = 4.2

ChrisGabor

Just wanted to add a question here on the topic of bandwidth. Let's say we have 4 DDR4 slots, could one cpu with 4 cores access 4x more bandwidth by accessing all 4 slots at the same time (1 for each core) or would we still be limited to a bus bandwidth of 76 GB/s? If so how can we, writing application level programs, choose to allocate our data onto 4 separate physical locations? I've seen mentions of using Spark for splitting data up among multiple processors to improve bandwidth and wondering if the same principle could be applied to multi-core cpus or gpus.

felixw17

I'm a bit late on adding a comment here, but I wanted to point out the connection between this idea and Program 5 (saxpy) of the first assignment. When I was reading through the description of saxpy, I remember thinking that we had seen a really similar concept during lecture but I couldn't remember the details, so I searched through the lecture slides and found this! Then I realized that the issue with the saxpy program was precisely IO bandwidth. In particular, my partner and I found that the IO of the saxpy program without tasks and with tasks was at ~26GB/s, which seems like a very reasonable upper bound on IO bandwidth, given the 76GB/s number above.