Summary
In this chapter, you learned that distributed training is indicated to accelerate the training process and training models that do not fit on a device’s memory. Although going distributed can be a way out for both cases, we must consider applying performance improvement techniques before going distributed.
We can perform distributed training by adopting the model parallelism or data parallelism strategy. The former employs different paradigms to divide the model computation among multiple computing resources, while the latter creates model replicas to be trained over chunks of the training dataset.
We also learned that PyTorch relies on third-party components such as communication backends and program launchers to execute the distributed training process.
In the next chapter, we will learn how to spread out the distributed training process so that it can run on multiple CPUs located in a single machine.