What do you get with a Packt Subscription?

Free for first 7 days. ₹800 p/m after that. Cancel any time!

Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!

50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.

Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.

Thousands of reference materials covering every tech concept you need to stay up to date.

Subscribe now

View plans & pricing

Abstracting Data with RDDs

In this chapter, we will cover how to work with Apache Spark Resilient Distributed Datasets. You will learn the following recipes:

Creating RDDs
Reading data from files
Overview of RDD transformations
Overview of RDD actions
Pitfalls of using RDDs

Reading data from files

For this recipe, we will create an RDD by reading a local file in PySpark. To create RDDs in Apache Spark, you will need to first install Spark as noted in the previous chapter. You can use the PySpark shell and/or Jupyter notebook to run these code samples. Note that while this recipe is specific to reading local files, a similar syntax can be applied for Hadoop, AWS S3, Azure WASBs, and/or Google Cloud Storage:

Storage type	Example
Local files	`sc.textFile('/local folder/filename.csv')`
Hadoop HDFS	`sc.textFile('hdfs://folder/filename.csv')`
AWS S3 (https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-configure.html)	`sc.textFile('s3://bucket/folder/filename.csv')`
Azure WASBs (https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-use-blob-storage)	`sc.textFile('wasb://bucket/folder/filename...`

Key benefits

Perform effective data processing, machine learning, and analytics using PySpark

Overcome challenges in developing and deploying Spark solutions using Python

Explore recipes for efficiently combining Python and Apache Spark to process data

Description

Apache Spark is an open source framework for efficient cluster computing with a strong interface for data parallelism and fault tolerance. The PySpark Cookbook presents effective and time-saving recipes for leveraging the power of Python and putting it to use in the Spark ecosystem. You’ll start by learning the Apache Spark architecture and how to set up a Python environment for Spark. You’ll then get familiar with the modules available in PySpark and start using them effortlessly. In addition to this, you’ll discover how to abstract data with RDDs and DataFrames, and understand the streaming capabilities of PySpark. You’ll then move on to using ML and MLlib in order to solve any problems related to the machine learning capabilities of PySpark and use GraphFrames to solve graph-processing problems. Finally, you will explore how to deploy your applications to the cloud using the spark-submit command. By the end of this book, you will be able to use the Python API for Apache Spark to solve any problems associated with building data-intensive applications.

What you will learn

Configure a local instance of PySpark in a virtual environment

Install and configure Jupyter in local and multi-node environments

Create DataFrames from JSON and a dictionary using pyspark.sql

Explore regression and clustering models available in the ML module

Use DataFrames to transform data used for modeling

Connect to PubNub and perform aggregations on streams

What do you get with a Packt Subscription?

Free for first 7 days. ₹800 p/m after that. Cancel any time!

Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!

50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.

Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.

Thousands of reference materials covering every tech concept you need to stay up to date.

Subscribe now

View plans & pricing

Frequently bought together

Learning PySpark

Feb 2017 274 pages

3.9 (194)

eBook

₹799.99 ~~₹2919.99~~

PySpark Cookbook

Jun 2018 330 pages

1.7 (3)

eBook

₹799.99 ~~₹2621.99~~

Hands-On Big Data Analytics with PySpark

Mar 2019 182 pages

1.8 (5)

eBook

₹799.99 ~~₹1548.99~~

Total ₹ 8,862.97

₹3649.99

₹3276.99

₹1935.99

Total ₹ 8,862.97

Dimitri Shvorob Oct 02, 2020

Wishing to learn Spark, I signed up for Databricks Associate Spark Developer certification exam - Python flavor - and ordered off Amazon a number of Spark books, avoiding Scala-based titles, and older titles pre-dating the DataFrame API. I ended up with the following list:"Learning PySpark" by Drabas and Lee, published by Packt in 2017"Frank Kane's Taming Big Data with Apache Spark and Python" by (no surprise) Kane, Packt, 2017"Data Analytics with Spark Using Python" by Aven, Addison Wesley, 2018"PySpark Cookbook" by (once again) Drabas and Lee, Packt, 2018"Developing Spark Applications with Python" by Morera and Campos, self-published in 2019"PySpark Recipes" by Mishra, Apress, 2017"Learning Spark" by Damjil et al., O'Reilly, 2020"Beginning Apache Spark Using Azure Databricks" by Ilijason, Apress, 2020"Spark: The Definitive Guide" by Chambers and Zaharia, O'Reilly, 2018Databricks themselves point to "Learning Spark" and "Spark: The Definitive Guide" as preparation aids, so I started with these, skimming both books - and strongly preferring "The Definitive Guide" - and then took a look at the others."PySpark Cookbook" is an easy "pass". It is not as low-quality as the books by Mishra or by Morera and Campo, but it is still a low-quality, low-value-added affair of the type routinely churned out by Packt. Much of the page count is spent on setup matters, where directions may be out of date - then when we get to Spark, a lot of space is taken up by the old RDD interface. Strikingly, Spark SQL gets all of 3 pages (pp. 117-119). Chapter 4 has some more interesting content - several non-trivial data-manipulation tasks that actually merit the "recipe" label - but with that, "core" Spark content ends, and the authors get into streaming, ML and graphs. It's important to remember that Packt pages have less text than pages of books from other publishers: here, 300 "Packt pages" translate to maybe 150 "normal" pages, and that is not a lot.Skip this book, and consider the Databricks-based introduction by Ilijason and the comprehensive but very accessible reference by Chambers and Zaharia.

Amazon Verified review

mmays Apr 17, 2022

Pretty good text, and I like the approach the author takes, but the Kindle version is really awful for the illegible graphics. I've tried them on a Kindle reader, Kindle cloud in a browser, copy and paste, no joy, they are just too small and illegible if magnified.

Victor Tkachenko Jul 06, 2018

This is a plagiary. Guys simply copied all info from the Wiki and trying to make money on it.Shame. No explanation of the code as far as I concern. Don't buy it, You can get more info from Googling...

PySpark Cookbook: Over 60 recipes for implementing big data processing and analytics using Apache Spark and Python

What do you get with a Packt Subscription?

PySpark Cookbook

Abstracting Data with RDDs

Introduction

Creating RDDs

Getting ready

How to do it...

Reading data from files

Overview of RDD transformations

Getting ready

Overview of RDD actions

Getting ready

Pitfalls of using RDDs

Page 1 of 7

Key benefits

Description

Who is this book for?

What you will learn

Product Details

What do you get with a Packt Subscription?

Product Details

Frequently bought together

Table of Contents

Recommendations for you

Customer reviews

People who bought this also bought

About the authors

FAQs

PySpark Cookbook: Over 60 recipes for implementing big data processing and analytics using Apache Spark and Python

What do you get with a Packt Subscription?

Key benefits

Description

Who is this book for?

What you will learn

Product Details

What do you get with a Packt Subscription?

Product Details

Packt Subscriptions

Frequently bought together

Table of Contents

Recommendations for you

Customer reviews

People who bought this also bought

About the authors

FAQs