jyeung27

From lecture, this diagram basically outlines why SIMDs aren't always effective because with this conditional code, it can only run the "trues" at the same time and the "false" code has to wait to be executed. However, our goal is to have all 8 cases run at the same time.

suninhouse

How did we exactly "cross out" certain instructions or results from certain instructions based on the conditional value? In other words, what is the "crossing out" exactly doing? There may be multiple way to "cross out" but what are the typical practices?

kostun

@suninhouse good question. also curious about this.

can different answers be in what SIMD instructions are compiled down to?

could something like this work for the above example?

execute all (x > 0)'s in parallel, store the 8 results. execute all 8 instructions in parallel (regardless of true/false). when the conditional body is done, mask each ALU's result with the saved value of the (x > 0). so the result of the (x > 0) could be used directly if we are executing in the true case, but would have to be flipped to execute the else case.

kayvonf

@kostun, @suninhouse -- you're on the right track, and this will be very clear after Assignment 1, program 2. For example, an ISA might support a "masked" operation, which essentially disregards and instructions output for a specific vector lanes. This can be thought of as not doing the operation for that lane, or as doing the operation and not writing the result to the target register.

For example: https://software.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/compiler-reference/intrinsics/intrinsics-for-intel-advanced-vector-extensions-2/intrinsics-for-masked-load-store-operations.html

wzz

For an example that illustrates the worst case 1/8 peak performance as mentioned in lecture, consider an 8-wide SIMD with an if-else condition where only one of the eight branches evaluates to True. If the true branch has 10000 instructions but the false branch has only 1 instruction, then the SIMD still needs to run 10001 clock cycles, even though only 10000 + 7*1 = 10007 of the executed arithmetic was useful. At the limit this leads to an efficiency of 1, compared to the optimal efficiency of 8 arithmetic per cycle.

nickbowman

@wzz Thanks for the summary of the worst case scenario from class, that's super helpful!