As described in the next slide, if BLOCKSIZE I is small (extremely smaller than SIMD_WIDTH), we cannot fully utilize all ALUs for SIMD. Thus, with small BLOCKSIZE_I, it is better to use another scheme described in the next slide. By transposing the matrix B, we can still utilize SIMD fully by catching targets along the BLOCKSIZE_K.
As described in the next slide, if BLOCKSIZE I is small (extremely smaller than SIMD_WIDTH), we cannot fully utilize all ALUs for SIMD. Thus, with small BLOCKSIZE_I, it is better to use another scheme described in the next slide. By transposing the matrix B, we can still utilize SIMD fully by catching targets along the BLOCKSIZE_K.