Summary
In this chapter, we learned about two classes of parallelism: data parallelism and task parallelism. Data parallelism is good for tasks that can be performed in parallel on partitions of a dataset. The dataset to be processed is split into partitions and each partition is processed on a different worker processes. Task parallelism, on the other hand, divides a set of similar or different tasks to amongst the worker processes. In either case, Amdahl's law states that the maximum improvement in speed that can be achieved by parallelizing code is limited by the proportion of that code that can be parallelized.
R supports both types of parallelism using the parallel package. We learned how to implement both data parallel and task parallel algorithms using socket-based clusters and forked clusters. We also learned how to run tasks in parallel on a cluster of computers using socket-based clusters.
The examples in this chapter demonstrated that the improvement in performance by parallelizing...