Summary
In this chapter, we learned how to distribute the training process across multiple GPUs located on multiple machines. We used Open MPI as the launch provider and NCCL as the communication backend.
We decided to use Open MPI as the launcher because it provides an easy and elegant way to create distributed processes on remote machines. Although Open MPI can also be employed like the communication backend, it is preferable to adopt NCCL since it has the most optimized implementation of collective operations for NVIDIA GPUs.
Results showed that the distributed training with 16 GPUs on two machines was 70% faster than running with 8 GPUs on a single machine. The model accuracy decreased from 68.82% to 63.73%, which is expected since we have doubled the number of model replicas in the distributed training process.
This chapter ends our journey about learning how to accelerate the training process with PyTorch. More than knowing how to apply techniques and methods to speed...