Implementing distributed training on multiple machines
This section shows how to implement and run the distributed training on multiple machines by using Open MPI as the launch provider and NCCL as the communication backend. Let’s start by introducing Open MPI.
Getting introduced to Open MPI
MPI stands for message passing interface and is a standard that specifies a set of communication routines, data types, events, and operations used to implement distributed memory-based applications. MPI is so relevant to the HPC industry that it is ruled and maintained by a forum comprised of distinguished scientists, researchers, and professionals around the globe.
Note
You can find more information about MPI at this link: https://www.mpi-forum.org/
Therefore, MPI, strictly speaking, is not software; it is a standard specification that can be used to implement a software, tool, or library. Like non-proprietary programming languages such as C and Python, MPI also has many...