Previous | Next --- Slide 17 of 50

wzz

This algorithm is efficient in terms of raw # of computations, but it has large constant overhead and locality issues. Thus, the algorithm described in the next few slides often runs faster in practice.

nickbowman

To expand on the locality issue here, looking at the previous slides we see that many of the memory accesses required for this algorithm are non-contiguous and often span very large swaths of memory. Such a memory access pattern is not very cache-friendly at all, and thus this algorithm would suffer from degraded performance due to this lack of locality.

Claire

Can someone explain again why we do a up-sweep and a down-sweep?

teapot

The visualization on the next slide explains the two sweeps? I think this page also helps https://en.wikipedia.org/wiki/Prefix_sum#Shared_memory:_Two-level_algorithm

sagoyal

@wzz could you expand on why the constant factors are large? In terms of work if we compute the actual work of N + N/2 + ... = 2N and we do that twice (up-sweep and down-sweep) so isn't the constant factor just 4? Also for the span is the constant factor just 2?

wanze

I don't see the incentive to use this algorithm, since the Span doesn't change at all? Is this algorithm better in a sense that we need fewer threads when doing parallelism?

kayvonf

Span gives the minimum execution time of an algorithm assuming a machine with infinite parallelism. But a real machine has finite parallel processing resources. Imagine the number of processors is 100 and N here is a 1,000,000. It seems useful to use an algorithm that did O(N) work here instead of O(NlgN), right? The constant here on the total amount of work is about 3. But lg(1,000,000) ~ 20.

Please log in to leave a comment.