Then, we will extend this idea to the CUDA implementation...
bayfc
Even though this algorithm is less parallelizable than the previous algorithm, it is more optimal on a two-core CPU as it is work efficient while maintaining high cache locality and using all the parallelism that the system can take advantage off. It seems like the broader lesson is that big-O analysis is not sufficient to determine the optimum implementation of an algorithm on a particular piece of hardware.
Then, we will extend this idea to the CUDA implementation...