Training with Multiple Machines
We’ve finally arrived at the last mile of our performance improvement journey. In this last stage, we will broaden our horizons and learn how to distribute the training process across multiple machines or servers. So, instead of using four or eight devices, we can use dozens or hundreds of computing resources to train our models.
An environment comprised of multiple connected servers is usually called a computing cluster or simply a cluster. Such environments are shared among multiple users and have technical particularities such as a high bandwidth and low latency network.
In this chapter, we’ll describe the characteristics of computing clusters that are more relevant to the distributed training process. After that, we will learn how to distribute the training process among multiple machines using Open MPI as the launcher and NCCL as the communication backend.
Here is what you will learn as part of this chapter:
- The most...