Implementing distributed training on multiple CPUs
This section shows how to implement and run the distributed training on multiple CPUs using Gloo, a simple yet powerful communication backend.
The Gloo communication backend
In Chapter 8, Distributed Training at a Glance, we learned PyTorch relies on backends to control the communication among devices and machines involved in distributed training.
The most basic communication backend supported by PyTorch is called Gloo. This backend comes with PyTorch by default and does not require any particular configuration. The Gloo backend is a collective communication library created by Facebook, and it is now an open-source project governed by the BSD license.
Note
You can find the source code of Gloo at http://github.com/facebookincubator/gloo.
As Gloo is very simple to use and is available by default on PyTorch, it appears to be the first option to run the distributed training in an environment comprising only CPUs and machines...