Distributed training on PyTorch
This section introduces the basic workflow to implement distributed training on PyTorch, besides presenting the components used in this process.
Basic workflow
Generally speaking, the basic workflow to implement distributed training on PyTorch comprises the steps illustrated in Figure 8.14:
Figure 8.14 – Basic workflow to implement distributed training in PyTorch
Let’s look at each step in more detail.
Note
The complete code shown in this section is available at https://github.com/PacktPublishing/Accelerate-Model-Training-with-PyTorch-2.X/blob/main/code/chapter08/pytorch_ddp.py.
Initialize and destroy the communication group
The communication group is the logical entity that’s used by PyTorch to define and control the distributed environment. So, the first step to code the distributed training concerns initializing a communication group. This step is performed by instantiating an object...