Summary
In this chapter, we learned how to distribute the training process across multiple GPUs by using NCCL, the optimized NVIDIA library for collective communication.
We started this chapter by understanding how a multi-GPU environment employs distinct technologies to interconnect devices. Depending on the technology and interconnection topology, the communication between devices can slow down the entire distributed training process.
After being introduced to the multi-GPU environment, we learned how to code and launch distributed training on multiple GPUs by using NCCL as the communication backend and torchrun
as the launch provider.
The experimental evaluation of our multi-GPU implementation showed that distributed training with 8 GPUs was 6.5 times faster than running with a single GPU; this is an expressive performance improvement. We also learned that model accuracy can be affected by performing distributed training on multiple GPUs, so we must take it into account...