You're reading from Learning PySpark Build data-intensive applications locally and deploy at scale using the combined powers of Python and Spark 2.0

Product type Paperback

Published in Feb 2017

Publisher Packt

ISBN-13 9781786463708

Length 274 pages

Edition 1st Edition

Languages

Python

Tools

Apache Spark

Concepts

Data Processing

Authors (2):

Denny Lee

Tomasz Drabas

View More author details

Table of Contents (13) Chapters

Preface

1. Understanding Spark

2. Resilient Distributed Datasets FREE CHAPTER

3. DataFrames

4. Prepare Data for Modeling

5. Introducing MLlib

6. Introducing the ML Package

7. GraphFrames

8. TensorFrames

9. Polyglot Persistence with Blaze

10. Structured Streaming

11. Packaging Spark Applications

Index

Preface

It is estimated that in 2013 the whole world produced around 4.4 zettabytes of data; that is, 4.4 billion terabytes! By 2020, we (as the human race) are expected to produce ten times that. With data getting larger literally by the second, and given the growing appetite for making sense out of it, in 2004 Google employees Jeffrey Dean and Sanjay Ghemawat published the seminal paper MapReduce: Simplified Data Processing on Large Clusters. Since then, technologies leveraging the concept started growing very quickly with Apache Hadoop initially being the most popular. It ultimately created a Hadoop ecosystem that included abstraction layers such as Pig, Hive, and Mahout – all leveraging this simple concept of map and reduce.

However, even though capable of chewing through petabytes of data daily, MapReduce is a fairly restricted programming framework. Also, most of the tasks require reading and writing to disk. Seeing these drawbacks, in 2009 Matei Zaharia started working on Spark as part of his PhD. Spark was first released in 2012. Even though Spark is based on the same MapReduce concept, its advanced ways of dealing with data and organizing tasks make it 100x faster than Hadoop (for in-memory computations).

In this book, we will guide you through the latest incarnation of Apache Spark using Python. We will show you how to read structured and unstructured data, how to use some fundamental data types available in PySpark, build machine learning models, operate on graphs, read streaming data, and deploy your models in the cloud. Each chapter will tackle different problem, and by the end of the book we hope you will be knowledgeable enough to solve other problems we did not have space to cover here.