Distributed Training at a Glance
When we face a complex problem in real life, we usually try to solve it by dividing the big problem into small parts that are easier to treat. So, by combining the partial solutions obtained from the small pieces of the original problem, we reach the final solution. This strategy, called divide and conquer, is frequently used to solve computational tasks. We can say that this approach is the basis of the parallel and distributed computing areas.
It turns out that this idea of dividing a big problem into small pieces comes in handy to accelerate the training process of complex models. In cases where using a single resource is not enough to train the model in a reasonable time, the unique way out relies on breaking down the training process and spreading it across multiple resources. In other words, we need to distribute the training process.
Here is what you will learn as part of this chapter:
- The basic concepts of distributed training ...