In this chapter, we learned that distributing the training process on multiple computing cores can be more advantageous than increasing the number of threads used in traditional training. This happens because PyTorch can face a limit on the parallelism level employed in the regular training process.
To distribute the training among multiple computing cores located in a single machine, we can use Gloo, a simple communication backend that comes by default with PyTorch. The results showed that the distributed training with Gloo achieved a performance improvement of 25% while retaining the same model accuracy.
We also learned that oneCCL, an Intel collective communication library, can accelerate the training process even more when executed on Intel platforms. With Intel oneCCL as the communication backend, we reduced the training time by more than 40%. If we are willing to reduce the model accuracy a little bit, it is possible to train the model two times faster.
In the...