In order to have more performance these days, the problem is how to put data closer to the processor (locality). This is x1000 speedup issues. Better workload balance is important but it is only x4-8 speed up issues.
x2020
Here is an example where Intel processors expose APIs for data locality in multithreading programs: https://software.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/optimization-and-programming-guide/openmp-support/openmp-library-support/thread-affinity-interface-linux-and-windows.html
In order to have more performance these days, the problem is how to put data closer to the processor (locality). This is x1000 speedup issues. Better workload balance is important but it is only x4-8 speed up issues.