Previous | Next --- Slide 19 of 50

jxx1998

Then, we will extend this idea to the CUDA implementation...

bayfc

Even though this algorithm is less parallelizable than the previous algorithm, it is more optimal on a two-core CPU as it is work efficient while maintaining high cache locality and using all the parallelism that the system can take advantage off. It seems like the broader lesson is that big-O analysis is not sufficient to determine the optimum implementation of an algorithm on a particular piece of hardware.

Please log in to leave a comment.