You're reading from Mastering Spark for Data Science Lightning fast and scalable data science solutions

Product type Paperback

Published in Mar 2017

Publisher Packt

ISBN-13 9781785882142

Length 560 pages

Edition 1st Edition

Tools

Apache Spark

Concepts

Data Science

Authors (5):

David George

Matthew Hallett

Antoine Amend

Andrew Morgan

Albert Bifet

+1 more

View More author details

Table of Contents (15) Chapters

Preface

1. The Big Data Science Ecosystem

2. Data Acquisition FREE CHAPTER

3. Input Formats and Schema

4. Exploratory Data Analysis

5. Spark for Geographic Analysis

6. Scraping Link-Based External Data

7. Building Communities

8. Building a Recommendation System

9. News Dictionary and Real-Time Tagging System

10. Story De-duplication and Mutation

11. Anomaly Detection on Sentiment Analysis

12. TrendCalculus

13. Secure Data

14. Scalable Algorithms

What this book covers

Chapter 1, The Big Data Science Ecosystem, this chapter is an introduction to an approach and accompanying ecosystem for achieving success with data at scale. It focuses on the data science tools and technologies that will be used in later chapters as well as introducing the environment and how to configure it appropriately. Additionally it explains some of the non-functional considerations relevant to the overall data architecture and long-term success.

Chapter 2, Data Acquisition, as a data scientist, one of the most important tasks is to accurately load data into a data science platform. Rather than having uncontrolled, ad hoc processes, this chapter explains how a general data ingestion pipeline in Spark can be constructed that serves as a reusable component across many feeds of input data.

Chapter 3, Input Formats and Schema, this chapter demonstrates how to load data from its raw format onto different schemas, therefore enabling a variety of different kinds of downstream analytics to be run over the same data. With this in mind, we will look at the traditionally well-understood area of data schemas. We will cover key areas of traditional database modeling and explain how some of these cornerstone principles are still applicable to Spark today. In addition, whilst honing our Spark skills we will analyze the GDELT data model and show how to store this large dataset in an efficient and scalable manner.

Chapter 4, Exploratory Data Analysis, a common misconception is that an EDA is only for discovering the statistical properties of a dataset and providing insights about how it can be exploited. In practice, this isn’t the full story. A full EDA will extend that idea, and include a detailed assessment of the “feasibility of using this Data Feed in production.” It requires us to also understand how we would specify a production grade data loading routine for this dataset, one that might potentially run in a “lights out mode” for many years. This chapter offers a rapid method for doing Data Quality assessment using a “data profiling” technique to accelerate the process.

Chapter 5, Spark for Geographic Analysis, geographic processing is a powerful new use case for Spark, and this chapter demonstrates how to get started. The aim of this chapter is to explain how Data Scientists can process geographic data, using Spark, to produce powerful map based views of very large datasets. We demonstrate how to process spatio-temporal datasets easily via Spark integrations with Geomesa, which helps turn Spark into a sophisticated geographic processing engine. The chapter later leverages this spatio-temporal data to apply machine learning with a view to predicting oil prices.

Chapter 6, Scraping Link-Based External Data, this chapter aims to explain a common pattern for enhancing local data with external content found at URLs or over APIs, such as GDELT and Twitter. We offer a tutorial using the GDELT news index service as a source of news URLS, demonstrating how to build a web scale News Scanner that scrapes global breaking news of interest from the internet. We further explain how to use the specialist web-scraping component in a way that overcomes the challenges of scale, followed by the summary of this chapter.

Chapter 7, Building Communities, this chapter aims to address a common use case in data science and big data. With more and more people interacting together, communicating, exchanging information, or simply sharing a common interest in different topics, the entire world can be represented as a Graph. A data scientist must be able to detect communities, find influencers / top contributors, and detect possible anomalies.

Chapter 8, Building a Recommendation System, if one were to choose an algorithm to showcase data science to the public, a recommendation system would certainly be in the frame. Today, recommendation systems are everywhere; the reason for their popularity is down to their versatility, usefulness and broad applicability. In this chapter, we will demonstrate how to recommend music content using raw audio signals.

Chapter 9, News Dictionary and Real-Time Tagging System, while a hierarchical data warehouse stores data in files of folders, a typical Hadoop based system relies on a flat architecture to store your data. Without a proper data governance or a clear understanding of what your data is all about, there is an undeniable chance of turning data lakes into swamps, where an interesting dataset such as GDELT would be nothing more than a folder containing a vast amount of unstructured text files. In this chapter, we will be describing an innovative way of labeling incoming GDELT data in a non-supervised way and in near real time.

Chapter 10, Story De-duplication and Mutation, in this chapter, we de-duplicate and index the GDELT database into stories, before tracking stories over time and understanding the links between them, how they may mutate and if they could lead to any subsequent events in the near future. Core to this chapter is the concept of Simhash to detect near duplicates and building vectors to reduce dimensionality using Random Indexing.

Chapter 11, Anomaly Detection and Sentiment Analysis, perhaps the most notable occurrence of the year 2016 was the tense US presidential election and its eventual outcome: the election of President Donald Trump, a campaign that will long be remembered; not least for its unprecedented use of social media and the stirring up of passion among its users, most of whom made their feelings known through the use of hashtags. In this chapter, instead of trying to predict the outcome itself, we will aim to detect abnormal tweets during the US election using a real-time Twitter feed.

Chapter 12, TrendCalculus, long before the concept of “what’s trending” became a popular topic of study by data scientists, there was an older one that is still not well served by data science; it is that of Trends. Presently, the analysis of trends, if it can be called that, is primarily carried out by people “eyeballing” time series charts and offering interpretations. But what is it that people’s eyes are doing? This chapter describes an implementation in Apache Spark of a new algorithm for studying trends numerically: TrendCalculus.

Chapter 13, Secure Data, throughout this book we visit many areas of data science, often straying into those that are not traditionally associated with a data scientist’s core working knowledge. In this chapter we will visit another of those often overlooked fields, Secure Data; more specifically, how to protect your data and analytic results at all stages of the data life cycle. Core to this chapter is the construction of a commercial grade encryption codec for Spark.

Chapter 14, Scalable Algorithms, in this chapter we learn about why sometimes even basic algorithms, despite working at small scale, will often fail in “big data”. We’ll see how to avoid issues when writing Spark jobs that run over massive Datasets and will learn about the structure of algorithms and how to write custom data science analytics that scale over petabytes of data. The chapter features areas such as: parallelization strategies, caching, shuffle strategies, garbage collection optimization and probabilistic models; explaining how these can help you to get the most out of the Spark paradigm.