You're reading from Essential PySpark for Scalable Data Analytics A beginner's guide to harnessing the power and ease of PySpark 3

Product type Paperback

Published in Oct 2021

Publisher Packt

ISBN-13 9781800568877

Length 322 pages

Edition 1st Edition

Languages

Python

Tools

PySpark

Concepts

Big Data

Author (1):

Sreeram Nudurupati

View More author details

Table of Contents (19) Chapters

Preface

1. Section 1: Data Engineering

2. Chapter 1: Distributed Computing Primer FREE CHAPTER

3. Chapter 2: Data Ingestion

4. Chapter 3: Data Cleansing and Integration

5. Chapter 4: Real-Time Data Analytics

6. Section 2: Data Science

7. Chapter 5: Scalable Machine Learning with PySpark

8. Chapter 6: Feature Engineering – Extraction, Transformation, and Selection

9. Chapter 7: Supervised Machine Learning

10. Chapter 8: Unsupervised Machine Learning

11. Chapter 9: Machine Learning Life Cycle Management

12. Chapter 10: Scaling Out Single-Node Machine Learning Using PySpark

13. Section 3: Data Analysis

14. Chapter 11: Data Visualization with PySpark

15. Chapter 12: Spark SQL Primer

16. Chapter 13: Integrating External Tools with Spark SQL

17. Chapter 14: The Data Lakehouse

18. Other Books You May Enjoy

Distributed Computing

In this section, you will learn about Distributed Computing, the need for it, and how you can use it to process very large amounts of data in a quick and efficient manner.

Introduction to Distributed Computing

Distributed Computing is a class of computing techniques where we use a group of computers as a single unit to solve a computational problem instead of just using a single machine.

In data analytics, when the amount of data becomes too large to fit in a single machine, we can either split the data into smaller chunks and process it on a single machine iteratively, or we can process the chunks of data on several machines in parallel. While the former gets the job done, it might take longer to process the entire dataset iteratively; the latter technique gets the job completed in a shorter period of time by using multiple machines at once.

There are different kinds of Distributed Computing techniques; however, for data analytics, one popular technique is Data Parallel Processing.

Data Parallel Processing

Data Parallel Processing involves two main parts:

The actual data that needs to be processed
The piece of code or business logic that needs to be applied to the data in order to process it

We can process large amounts of data by splitting it into smaller chunks and processing them in parallel on several machines. This can be done in two ways:

First, bring the data to the machine where our code is running.
Second, take our code to where our data is actually stored.

One drawback of the first technique is that as our data sizes become larger, the amount of time it takes to move data also increases proportionally. Therefore, we end up spending more time moving data from one system to another and, in turn, negating any efficiency gained by our parallel processing system. We also find ourselves creating multiple copies of data during data replication.

The second technique is far more efficient because instead of moving large amounts of data, we can easily move a few lines of code to where our data actually resides. This technique of moving code to where the data resides is referred to as Data Parallel Processing. This Data Parallel Processing technique is very fast and efficient, as we save the amount of time that was needed earlier to move and copy data across different systems. One such Data Parallel Processing technique is called the MapReduce paradigm.

Data Parallel Processing using the MapReduce paradigm

The MapReduce paradigm breaks down a Data Parallel Processing problem into three main stages:

The Map stage
The Shuffle stage
The Reduce stage

The Map stage takes the input dataset, splits it into (key, value) pairs, applies some processing on the pairs, and transforms them into another set of (key, value) pairs.

The Shuffle stage takes the (key, value) pairs from the Map stage and shuffles/sorts them so that pairs with the same key end up together.

The Reduce stage takes the resultant (key, value) pairs from the Shuffle stage and reduces or aggregates the pairs to produce the final result.

There can be multiple Map stages followed by multiple Reduce stages. However, a Reduce stage only starts after all of the Map stages have been completed.

Let's take a look at an example where we want to calculate the counts of all the different words in a text document and apply the MapReduce paradigm to it.

The following diagram shows how the MapReduce paradigm works in general:

Figure 1.1 – Calculating the word count using MapReduce

The previous example works in the following manner:

In Figure 1.1, we have a cluster of three nodes, labeled M1, M2, and M3. Each machine includes a few text files containing several sentences in plain text. Here, our goal is to use MapReduce to count all of the words in the text files.
We load all the text documents onto the cluster; each machine loads the documents that are local to it.
The Map Stage splits the text files into individual lines and further splits each line into individual words. Then, it assigns each word a count of 1 to create a (word, count) pair.
The Shuffle Stage takes the (word, count) pairs from the Map stage and shuffles/sorts them so that word pairs with the same keyword end up together.
The Reduce Stage groups all keywords together and sums their counts to produce the final count of each individual word.

The MapReduce paradigm was popularized by the Hadoop framework and was pretty popular for processing big data workloads. However, the MapReduce paradigm offers a very low-level API for transforming data and requires users to have proficient knowledge of programming languages such as Java. Expressing a data analytics problem using Map and Reduce is not very intuitive or flexible.

MapReduce was designed to run on commodity hardware, and since commodity hardware was prone to failures, resiliency to hardware failures was a necessity. MapReduce achieves resiliency to hardware failures by saving the results of every stage to disk. This round-trip to disk after every stage makes MapReduce relatively slow at processing data because of the slow I/O performance of physical disks in general. To overcome this limitation, the next generation of the MapReduce paradigm was created, which made use of much faster system memory, as opposed to disks, to process data and offered much more flexible APIs to express data transformations. This new framework is called Apache Spark, and you will learn about it in the next section and throughout the remainder of this book.

Important note

In Distributed Computing, you will often encounter the term cluster. A cluster is a group of computers all working together as a single unit to solve a computing problem. The primary machine of a cluster is, typically, termed the Master Node, which takes care of the orchestration and management of the cluster, and secondary machines that actually carry out task execution are called Worker Nodes. A cluster is a key component of any Distributed Computing system, and you will encounter these terms throughout this book.