Packt+ | Advance your knowledge in tech

You're reading from Hands-On Big Data Analytics with PySpark Analyze large datasets and discover techniques for testing, immunizing, and parallelizing Spark jobs

Product type Paperback

Published in Mar 2019

Publisher Packt

ISBN-13 9781838644130

Length 182 pages

Edition 1st Edition

Languages

Python

Tools

PySpark

Concepts

Big Data

Authors (3):

James Cross

Bartłomiej Potaczek

Rudy Lai

View More author details

Before we start with installing PySpark, which is the Python interface for Spark, let's go through some core concepts in Spark and PySpark. Spark is the latest big data tool from Apache, which can be found by simply going to http://spark.apache.org/. It's a unified analytics engine for large-scale data processing. This means that, if you have a lot of data, you can feed that data into Spark to create some analytics at a good speed. If we look at the running times between Hadoop and Spark, Spark is more than a hundred times faster than Hadoop. It is very easy to use because there are very good APIs for use with Spark.

The four major components of the Spark platform are as follows:

Spark SQL: A clearing language for Spark
Spark Streaming: Allows you to feed in real-time streaming data
MLlib (machine learning): The machine learning library for Spark
GraphX (graph): The graphing library for Spark

The core concept in Spark is an RDD, which is similar to the pandas DataFrame, or a Python dictionary or list. It is a way for Spark to store large amounts of data on the infrastructure for us. The key difference of an RDD versus something that is in your local memory, such as a pandas DataFrame, is that an RDD is distributed across many machines, but it appears like one unified dataset. What this means is, if you have large amounts of data that you want to operate on in parallel, you can put it in an RDD and Spark will handle parallelization and the clustering of the data for you.

Spark has three different interfaces, as follows:

Scala
Java
Python

Python is similar to PySpark integration, which we will cover soon. For now, we will import some libraries from the PySpark package to help us work with Spark. The best way for us to understand Spark is to look at an example, as shown in the following screenshot:

lines = sc.textFile("data.txt")
lineLengths = lines.map(lambda s: len(s))
totalLength = lineLengths.reduce(lambda a, b: a + b)

In the preceding code, we have created a new variable called lines by calling SC.textFile ("data.txt"). sc is our Python objects that represent our Spark cluster. A Spark cluster is a series of instances or cloud computers that store our Spark processes. By calling a textFile constructor and feeding in data.text, we have potentially fed in a large text file and created an RDD just using this one line. In other words, what we are trying to do here is to feed a large text file into a distributed cluster and Spark, and Spark handles this clustering for us.

In line two and line three, we have a MapReduce function. In line two, we have mapped the length function using a lambda function to each line of data.text. In line three, we have called a reduction function to add all lineLengths together to produce the total length of the documents. While Python's lines is a variable that contains all the lines in data.text, under the hood, Spark is actually handling the distribution of fragments of data.text in two different instances on the Spark cluster, and is handling the MapReduce computation over all of these instances.

#Register the DataFrame as a SQL temporary view
df.CreateOrReplaceTempView("people")

sqlDF = spark.sql("SELECT * FROM people")
sqlDF.show()

#+----+-------+
#| age| name|
#+----+-------+
#+null|Jackson|
#| 30| Martin|
#| 19| Melvin|
#+----|-------|

# spark is an existing SparkSession
df = spark.read.json("examples/src/main/resources/people.json")
# Displays the content of the DataFrame to stdout
df.show()

#+----+-------+
#| age| name|
#+----+-------+
#+null|Jackson|
#| 30| Martin|
#| 19| Melvin|
#+----|-------|