By increasing ParO, the iterations needed for Pipe.Reduce are decreased. When ParO=2, the memory bandwidth is enough so we can reduce the runtime by a factor of 2. However, when ParO=4, we run out of memory bandwidth so the time needed for DRAM Transfers is increased. Therefore, we are not able to achieve more performance improvement even though the iterations for Pipe.Reduce are further decreased.
marwan
I don't get how doubling ParO didn't cause the bandwidth to limit our performance. Does this mean that our bandwidth was enough for two iterations in parallel. And if that is the case, I have a silly question can we use ParO value of 3 if the bandwidth was enough and if the number of iterations was divisible by 3?
a7hu
When ParO= 2 -> 4, the DRAM transfer doubles as we run out of memory bandwidth. The system is memory bound and there is no more performance as outer parallelization increases.
By increasing ParO, the iterations needed for Pipe.Reduce are decreased. When ParO=2, the memory bandwidth is enough so we can reduce the runtime by a factor of 2. However, when ParO=4, we run out of memory bandwidth so the time needed for DRAM Transfers is increased. Therefore, we are not able to achieve more performance improvement even though the iterations for Pipe.Reduce are further decreased.