To expand on the locality issue here, looking at the previous slides we see that many of the memory accesses required for this algorithm are non-contiguous and often span very large swaths of memory. Such a memory access pattern is not very cache-friendly at all, and thus this algorithm would suffer from degraded performance due to this lack of locality.
Can someone explain again why we do a up-sweep and a down-sweep?
The visualization on the next slide explains the two sweeps? I think this page also helps https://en.wikipedia.org/wiki/Prefix_sum#Shared_memory:_Two-level_algorithm
@wzz could you expand on why the constant factors are large? In terms of work if we compute the actual work of N + N/2 + ... = 2N and we do that twice (up-sweep and down-sweep) so isn't the constant factor just 4? Also for the span is the constant factor just 2?
I don't see the incentive to use this algorithm, since the Span doesn't change at all? Is this algorithm better in a sense that we need fewer threads when doing parallelism?
Span gives the minimum execution time of an algorithm assuming a machine with infinite parallelism. But a real machine has finite parallel processing resources. Imagine the number of processors is 100 and N here is a 1,000,000. It seems useful to use an algorithm that did O(N) work here instead of O(NlgN), right? The constant here on the total amount of work is about 3. But lg(1,000,000) ~ 20.
Please log in to leave a comment.
This algorithm is efficient in terms of raw # of computations, but it has large constant overhead and locality issues. Thus, the algorithm described in the next few slides often runs faster in practice.