Previous | Next --- Slide 19 of 62
Back to Lecture Thumbnails
haiyuem

Ideally BLOCKSIZE needs to be as big as possible, because it's bringing us higher arithmetic intensity. However, we want blocks to fit into the cache. So the best BLOCKSIZE should be the same as cache size - the biggest we can get without cache thrashing.

rmjones

Haiyue is spot-on, except I think technically you want BLOCKSIZE to be something like sqrt(cache size / 3) so that you can fit 3 blocks for A, B, and C at the same time.

rmjones

And that's the number of bytes you'd want, I think you'd need to divide by sizeof(int) as well to get the actual number of integers in each block side.

haiyuem

@rmjones Thanks for the correction! This is more precise.

gleb

If we're trying to maximize BLOCKSIZE, does it matter whether data is stored contiguously in memory?

wooloo

@gleb not totally sure, but my impression is it does, assuming the cache lines hold more than a single element. Even though the main issue here is temporal locality (where we're evicting and reloading elements of A and B to the cache because they're used too far apart in time) we need the highlighted adjacent elements on the slide to be loaded into the cache together, due to how the convolution works.

Drew

This approach can be described as a sort of nested matrix product, where we are reducing the input and output matrices by a factor of BLOCKSIZE and computing the matrix product of the resulting matrices. However, to do this matrix product, we do many smaller matrix products between single elements of the reduced matrices, which are 32x32 square matrix products. I may not have explained this idea in words well, but it's illustrated well on this slide.

dishpanda

One key takeaway from this class: a lot of the times making things faster is not about throwing more compute "workers" at it but rather dealing with memory/cache more efficiently.

msere

Specifically, we want a higher block size as if a block fits in one level one level of memory, then cache misses will occur occur for all n elements of each dimension D of an N-D hiearchy. The smaller the block size, the more this will be (e.g.) the divisions of a square add up to more when the tiles are smaller)

Please log in to leave a comment.