Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
PySpark Cookbook
PySpark Cookbook

PySpark Cookbook: Over 60 recipes for implementing big data processing and analytics using Apache Spark and Python

eBook
R$80 R$196.99
Paperback
R$245.99
Subscription
Free Trial
Renews at R$50p/m

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Shipping Address

Billing Address

Shipping Methods
Table of content icon View table of contents Preview book icon Preview Book

PySpark Cookbook

Abstracting Data with RDDs

In this chapter, we will cover how to work with Apache Spark Resilient Distributed Datasets. You will learn the following recipes:

  • Creating RDDs
  • Reading data from files
  • Overview of RDD transformations
  • Overview of RDD actions
  • Pitfalls of using RDDs

Introduction

Resilient Distributed Datasets (RDDs) are collections of immutable JVM objects that are distributed across an Apache Spark cluster. Please note that if you are new to Apache Spark, you may want to initially skip this chapter as Spark DataFrames/Datasets are both significantly easier to develop and typically have faster performance. More information on Spark DataFrames can be found in the next chapter.

An RDD is the most fundamental dataset type of Apache Spark; any action on a Spark DataFrame eventually gets translated into a highly optimized execution of transformations and actions on RDDs (see the paragraph on catalyst optimizer in Chapter 3, Abstracting Data with DataFrames, in the Introduction section). 

Data in an RDD is split into chunks based on a key and then dispersed across all the executor nodes. RDDs are highly resilient, that is, there are able...

Creating RDDs

For this recipe, we will start creating an RDD by generating the data within the PySpark. To create RDDs in Apache Spark, you will need to first install Spark as shown in the previous chapter. You can use the PySpark shell and/or Jupyter notebook to run these code samples.

Getting ready 

We require a working installation of Spark. This means that you would have followed the steps outlined in the previous chapter. As a reminder, to start PySpark shell for your local Spark cluster, you can run this command:

./bin/pyspark --master local[n]

Where n is the number of cores. 

How to do it...

...

Reading data from files

For this recipe, we will create an RDD by reading a local file in PySpark. To create RDDs in Apache Spark, you will need to first install Spark as noted in the previous chapter. You can use the PySpark shell and/or Jupyter notebook to run these code samples. Note that while this recipe is specific to reading local files, a similar syntax can be applied for Hadoop, AWS S3, Azure WASBs, and/or Google Cloud Storage:

Storage type Example
Local files sc.textFile('/local folder/filename.csv')
Hadoop HDFS sc.textFile('hdfs://folder/filename.csv')
AWS S3 (https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-configure.html) sc.textFile('s3://bucket/folder/filename.csv')
Azure WASBs (https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-use-blob-storage) sc.textFile('wasb://bucket/folder/filename...

Overview of RDD transformations

As noted in preceding sections, there are two types of operation that can be used to shape data in an RDD: transformations and actions. A transformation, as the name suggests, transforms one RDD into another. In other words, it takes an existing RDD and transforms it into one or more output RDDs. In the preceding recipes, we had used a map() function, which is an example of a transformation to split the data by its tab-delimiter.

Transformations are lazy (unlike actions). They only get executed when an action is called on an RDD. For example, calling the count() function is an action; more information is available in the following section on actions.

Getting ready

This recipe...

Overview of RDD actions

As noted in preceding sections, there are two types of Apache Spark RDD operations: transformations and actions. An action returns a value to the driver after running a computation on the dataset, typically on the workers. In the preceding recipes, the take() and count() RDD operations are examples of actions.

Getting ready

This recipe will be reading a tab-delimited (or comma-delimited) file, so please ensure that you have a text (or CSV) file available. For your convenience, you can download the airport-codes-na.txt and departuredelays.csv files from learning http://bit.ly/2nroHbh. Ensure your local Spark cluster can access this file (~/data/flights/airport...

Pitfalls of using RDDs

The key concern associated with using RDDs is that they can take a lot of time to master. The flexibility of running functional operators such as map, reduce, and shuffle allows you to perform a wide variety of transformations against your data. But with this power comes great responsibility, and it is potentially possible to write code that is inefficient, such as the use of GroupByKey; more information can be found in Avoid GroupByKey at https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html.

Generally, you will typically have slower performance when using RDDs compared to Spark DataFrames, as noted in the following diagram:

Source: Introducing DataFrames in Apache Spark for Large Scale Data Science at https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark...
Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Perform effective data processing, machine learning, and analytics using PySpark
  • Overcome challenges in developing and deploying Spark solutions using Python
  • Explore recipes for efficiently combining Python and Apache Spark to process data

Description

Apache Spark is an open source framework for efficient cluster computing with a strong interface for data parallelism and fault tolerance. The PySpark Cookbook presents effective and time-saving recipes for leveraging the power of Python and putting it to use in the Spark ecosystem. You’ll start by learning the Apache Spark architecture and how to set up a Python environment for Spark. You’ll then get familiar with the modules available in PySpark and start using them effortlessly. In addition to this, you’ll discover how to abstract data with RDDs and DataFrames, and understand the streaming capabilities of PySpark. You’ll then move on to using ML and MLlib in order to solve any problems related to the machine learning capabilities of PySpark and use GraphFrames to solve graph-processing problems. Finally, you will explore how to deploy your applications to the cloud using the spark-submit command. By the end of this book, you will be able to use the Python API for Apache Spark to solve any problems associated with building data-intensive applications.

Who is this book for?

The PySpark Cookbook is for you if you are a Python developer looking for hands-on recipes for using the Apache Spark 2.x ecosystem in the best possible way. A thorough understanding of Python (and some familiarity with Spark) will help you get the best out of the book.

What you will learn

  • Configure a local instance of PySpark in a virtual environment
  • Install and configure Jupyter in local and multi-node environments
  • Create DataFrames from JSON and a dictionary using pyspark.sql
  • Explore regression and clustering models available in the ML module
  • Use DataFrames to transform data used for modeling
  • Connect to PubNub and perform aggregations on streams
Estimated delivery fee Deliver to Brazil

Standard delivery 10 - 13 business days

R$63.95

Premium delivery 3 - 6 business days

R$203.95
(Includes tracking information)

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Jun 29, 2018
Length: 330 pages
Edition : 1st
Language : English
ISBN-13 : 9781788835367
Category :
Languages :
Concepts :

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Shipping Address

Billing Address

Shipping Methods
Estimated delivery fee Deliver to Brazil

Standard delivery 10 - 13 business days

R$63.95

Premium delivery 3 - 6 business days

R$203.95
(Includes tracking information)

Product Details

Publication date : Jun 29, 2018
Length: 330 pages
Edition : 1st
Language : English
ISBN-13 : 9781788835367
Category :
Languages :
Concepts :

Packt Subscriptions

See our plans and pricing
Modal Close icon
R$50 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
R$500 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just R$25 each
Feature tick icon Exclusive print discounts
R$800 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just R$25 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total R$ 663.97
Learning PySpark
R$272.99
PySpark Cookbook
R$245.99
Hands-On Big Data Analytics with PySpark
R$144.99
Total R$ 663.97 Stars icon
Banner background image

Table of Contents

8 Chapters
Installing and Configuring Spark Chevron down icon Chevron up icon
Abstracting Data with RDDs Chevron down icon Chevron up icon
Abstracting Data with DataFrames Chevron down icon Chevron up icon
Preparing Data for Modeling Chevron down icon Chevron up icon
Machine Learning with MLlib Chevron down icon Chevron up icon
Machine Learning with the ML Module Chevron down icon Chevron up icon
Structured Streaming with PySpark Chevron down icon Chevron up icon
GraphFrames – Graph Theory with PySpark Chevron down icon Chevron up icon

Customer reviews

Rating distribution
Full star icon Half star icon Empty star icon Empty star icon Empty star icon 1.7
(3 Ratings)
5 star 0%
4 star 0%
3 star 0%
2 star 66.7%
1 star 33.3%
Dimitri Shvorob Oct 02, 2020
Full star icon Full star icon Empty star icon Empty star icon Empty star icon 2
Wishing to learn Spark, I signed up for Databricks Associate Spark Developer certification exam - Python flavor - and ordered off Amazon a number of Spark books, avoiding Scala-based titles, and older titles pre-dating the DataFrame API. I ended up with the following list:"Learning PySpark" by Drabas and Lee, published by Packt in 2017"Frank Kane's Taming Big Data with Apache Spark and Python" by (no surprise) Kane, Packt, 2017"Data Analytics with Spark Using Python" by Aven, Addison Wesley, 2018"PySpark Cookbook" by (once again) Drabas and Lee, Packt, 2018"Developing Spark Applications with Python" by Morera and Campos, self-published in 2019"PySpark Recipes" by Mishra, Apress, 2017"Learning Spark" by Damjil et al., O'Reilly, 2020"Beginning Apache Spark Using Azure Databricks" by Ilijason, Apress, 2020"Spark: The Definitive Guide" by Chambers and Zaharia, O'Reilly, 2018Databricks themselves point to "Learning Spark" and "Spark: The Definitive Guide" as preparation aids, so I started with these, skimming both books - and strongly preferring "The Definitive Guide" - and then took a look at the others."PySpark Cookbook" is an easy "pass". It is not as low-quality as the books by Mishra or by Morera and Campo, but it is still a low-quality, low-value-added affair of the type routinely churned out by Packt. Much of the page count is spent on setup matters, where directions may be out of date - then when we get to Spark, a lot of space is taken up by the old RDD interface. Strikingly, Spark SQL gets all of 3 pages (pp. 117-119). Chapter 4 has some more interesting content - several non-trivial data-manipulation tasks that actually merit the "recipe" label - but with that, "core" Spark content ends, and the authors get into streaming, ML and graphs. It's important to remember that Packt pages have less text than pages of books from other publishers: here, 300 "Packt pages" translate to maybe 150 "normal" pages, and that is not a lot.Skip this book, and consider the Databricks-based introduction by Ilijason and the comprehensive but very accessible reference by Chambers and Zaharia.
Amazon Verified review Amazon
mmays Apr 17, 2022
Full star icon Full star icon Empty star icon Empty star icon Empty star icon 2
Pretty good text, and I like the approach the author takes, but the Kindle version is really awful for the illegible graphics. I've tried them on a Kindle reader, Kindle cloud in a browser, copy and paste, no joy, they are just too small and illegible if magnified.
Amazon Verified review Amazon
Victor Tkachenko Jul 06, 2018
Full star icon Empty star icon Empty star icon Empty star icon Empty star icon 1
This is a plagiary. Guys simply copied all info from the Wiki and trying to make money on it.Shame. No explanation of the code as far as I concern. Don't buy it, You can get more info from Googling...
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is the delivery time and cost of print book? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela
What is custom duty/charge? Chevron down icon Chevron up icon

Customs duty are charges levied on goods when they cross international borders. It is a tax that is imposed on imported goods. These duties are charged by special authorities and bodies created by local governments and are meant to protect local industries, economies, and businesses.

Do I have to pay customs charges for the print book order? Chevron down icon Chevron up icon

The orders shipped to the countries that are listed under EU27 will not bear custom charges. They are paid by Packt as part of the order.

List of EU27 countries: www.gov.uk/eu-eea:

A custom duty or localized taxes may be applicable on the shipment and would be charged by the recipient country outside of the EU27 which should be paid by the customer and these duties are not included in the shipping charges been charged on the order.

How do I know my custom duty charges? Chevron down icon Chevron up icon

The amount of duty payable varies greatly depending on the imported goods, the country of origin and several other factors like the total invoice amount or dimensions like weight, and other such criteria applicable in your country.

For example:

  • If you live in Mexico, and the declared value of your ordered items is over $ 50, for you to receive a package, you will have to pay additional import tax of 19% which will be $ 9.50 to the courier service.
  • Whereas if you live in Turkey, and the declared value of your ordered items is over € 22, for you to receive a package, you will have to pay additional import tax of 18% which will be € 3.96 to the courier service.
How can I cancel my order? Chevron down icon Chevron up icon

Cancellation Policy for Published Printed Books:

You can cancel any order within 1 hour of placing the order. Simply contact [email protected] with your order details or payment transaction id. If your order has already started the shipment process, we will do our best to stop it. However, if it is already on the way to you then when you receive it, you can contact us at [email protected] using the returns and refund process.

Please understand that Packt Publishing cannot provide refunds or cancel any order except for the cases described in our Return Policy (i.e. Packt Publishing agrees to replace your printed book because it arrives damaged or material defect in book), Packt Publishing will not accept returns.

What is your returns and refunds policy? Chevron down icon Chevron up icon

Return Policy:

We want you to be happy with your purchase from Packtpub.com. We will not hassle you with returning print books to us. If the print book you receive from us is incorrect, damaged, doesn't work or is unacceptably late, please contact Customer Relations Team on [email protected] with the order number and issue details as explained below:

  1. If you ordered (eBook, Video or Print Book) incorrectly or accidentally, please contact Customer Relations Team on [email protected] within one hour of placing the order and we will replace/refund you the item cost.
  2. Sadly, if your eBook or Video file is faulty or a fault occurs during the eBook or Video being made available to you, i.e. during download then you should contact Customer Relations Team within 14 days of purchase on [email protected] who will be able to resolve this issue for you.
  3. You will have a choice of replacement or refund of the problem items.(damaged, defective or incorrect)
  4. Once Customer Care Team confirms that you will be refunded, you should receive the refund within 10 to 12 working days.
  5. If you are only requesting a refund of one book from a multiple order, then we will refund you the appropriate single item.
  6. Where the items were shipped under a free shipping offer, there will be no shipping costs to refund.

On the off chance your printed book arrives damaged, with book material defect, contact our Customer Relation Team on [email protected] within 14 days of receipt of the book with appropriate evidence of damage and we will work with you to secure a replacement copy, if necessary. Please note that each printed book you order from us is individually made by Packt's professional book-printing partner which is on a print-on-demand basis.

What tax is charged? Chevron down icon Chevron up icon

Currently, no tax is charged on the purchase of any print book (subject to change based on the laws and regulations). A localized VAT fee is charged only to our European and UK customers on eBooks, Video and subscriptions that they buy. GST is charged to Indian customers for eBooks and video purchases.

What payment methods can I use? Chevron down icon Chevron up icon

You can pay with the following card types:

  1. Visa Debit
  2. Visa Credit
  3. MasterCard
  4. PayPal
What is the delivery time and cost of print books? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela