A first look at distributed training
We’ll start this chapter by discussing the reasons for distributing the training process among multiple resources. Then, we’ll learn what resources are commonly used to execute this process.
When do we need to distribute the training process?
The most common reason to distribute the training process concerns accelerating the building process. Suppose the training process is taking a long time to complete, and we have multiple resources at hand. In that case, we should consider distributing the training process among these various resources to reduce the training time.
The second motivation for going distributed is related to memory leaks to load a large model in a single resource. In this situation, we rely on distributed training to allocate different parts of the large model into distinct devices or resources so that the model can be loaded into the system.
However, distributed training is not a silver bullet that solves...