This is an interesting approach were each worker spends more time computing gradients with larger batches to increase the arithmetic intensity. SGD with smaller batches tends to have the property of less computation required for each gradient step and lower overall computation to reach convergence which is good on a single CPU. It seems there would be an interesting tradeoff between an optimal batch size to hide latency but retain the benefits of small batch sizes.
jgrace
This is quite an interesting point. Large mini-batches can also sometimes provide more stable gradient updates than on smaller mini-batches, which help the model to learn faster.
This is an interesting approach were each worker spends more time computing gradients with larger batches to increase the arithmetic intensity. SGD with smaller batches tends to have the property of less computation required for each gradient step and lower overall computation to reach convergence which is good on a single CPU. It seems there would be an interesting tradeoff between an optimal batch size to hide latency but retain the benefits of small batch sizes.