Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Mastering Spark for Data Science
Mastering Spark for Data Science

Mastering Spark for Data Science: Lightning fast and scalable data science solutions

Arrow left icon
Profile Icon Bifet Profile Icon Morgan Profile Icon Amend Profile Icon Hallett Profile Icon George +1 more Show less
Arrow right icon
$19.99 per month
Full star icon Full star icon Full star icon Full star icon Empty star icon 4 (2 Ratings)
Paperback Mar 2017 560 pages 1st Edition
eBook
$32.99 $47.99
Paperback
$60.99
Subscription
Free Trial
Renews at $19.99p/m
Arrow left icon
Profile Icon Bifet Profile Icon Morgan Profile Icon Amend Profile Icon Hallett Profile Icon George +1 more Show less
Arrow right icon
$19.99 per month
Full star icon Full star icon Full star icon Full star icon Empty star icon 4 (2 Ratings)
Paperback Mar 2017 560 pages 1st Edition
eBook
$32.99 $47.99
Paperback
$60.99
Subscription
Free Trial
Renews at $19.99p/m
eBook
$32.99 $47.99
Paperback
$60.99
Subscription
Free Trial
Renews at $19.99p/m

What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing
Table of content icon View table of contents Preview book icon Preview Book

Mastering Spark for Data Science

Chapter 2. Data Acquisition

As a data scientist, one of the most important tasks is to load data into your data science platform. Rather than having uncontrolled, ad hoc processes, this chapter explains how a general data ingestion pipeline in Spark can be constructed that serves as a reusable component across many feeds of input data. We walk through a configuration and demonstrate how it delivers vital feed management information under a variety of running conditions.

Readers will learn how to construct a content register and use it to track all input loaded to the system and to deliver metrics on ingestion pipelines, so that these flows can be reliably run as an automated, lights-out process.

In this chapter, we will cover the following topics:

  • Introduce the Global Database of Events, Language, and Tone (GDELT) dataset
  • Data pipelines
  • Universal ingestion framework
  • Real-time monitoring for new data
  • Receiving streaming data via Kafka
  • Registering new content and vaulting for tracking purposes...

Data pipelines

Even with the most basic of analytics, we always require some data. In fact, finding the right data is probably among the hardest problems to solve in data science (but that's a whole topic for another book!). We have already seen in the last chapter that the way in which we obtain our data can be as simple or complicated as is needed. In practice, we can break this decision down into two distinct areas: ad hoc and scheduled.

  • Ad hoc data acquisition: is the most common method during prototyping and small scale analytics as it usually doesn't require any additional software to implement. The user acquires some data and simply downloads it from source as and when required. This method is often a matter of clicking on a web link and storing the data somewhere convenient, although the data may still need to be versioned and secure.
  • Scheduled data acquisition: is used in more controlled environments for large scale and production analytics; there is also an excellent...

Content registry

We have seen in this chapter that data ingestion is an area that is often overlooked, and that its importance cannot be underestimated. At this point, we have a pipeline that enables us to ingest data from a source, schedule that ingest, and direct the data to our repository of choice. But the story does not end there. Now we have the data, we need to fulfil our data management responsibilities. Enter the content registry.

We're going to build an index of metadata related to that data we have ingested. The data itself will still be directed to storage (HDFS, in our example) but, in addition, we will store metadata about the data, so that we can track what we've received and understand basic information about it, such as, when we received it, where it came from, how big it is, what type it is, and so on.

Choices and more choices

The choice of which technology we use to store this metadata is, as we have seen, one based upon knowledge and experience. For metadata indexing...

Quality assurance

With an initial data ingestion capability implemented, and data streaming onto your platform, you will need to decide how much quality assurance is required at the "front door". It's perfectly viable to start with no initial quality controls and build them up over time (retrospectively scanning historical data as time and resources allow). However, it may be prudent to install a basic level of verification to begin with. For example, basic checks such as file integrity, parity checking, completeness, checksums, type checking, field counting, overdue files, security field pre-population, denormalization, and so on.

You should take care that your up-front checks do not take too long. Depending on the intensity of your examinations and the size of your data, it's not uncommon to encounter a situation where there is not enough time to perform all processing before the next dataset arrives. You will always need to monitor your cluster resources and calculate...

Summary

In this chapter, we walked through the full setup of an Apache NiFi GDELT ingest pipeline, complete with metadata forks and a brief introduction to visualizing the resulting data. This section is particularly important as GDELT is used extensively throughout the book and the NiFi method is a highly effective way to source data in a scalable and modular way.

In the next chapter, we will get to grips with what to do with the data once it's landed, by looking at schemas and formats.

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Develop and apply advanced analytical techniques with Spark
  • Learn how to tell a compelling story with data science using Spark’s ecosystem
  • Explore data at scale and work with cutting edge data science methods

Description

Data science seeks to transform the world using data, and this is typically achieved through disrupting and changing real processes in real industries. In order to operate at this level you need to build data science solutions of substance –solutions that solve real problems. Spark has emerged as the big data platform of choice for data scientists due to its speed, scalability, and easy-to-use APIs. This book deep dives into using Spark to deliver production-grade data science solutions. This process is demonstrated by exploring the construction of a sophisticated global news analysis service that uses Spark to generate continuous geopolitical and current affairs insights.You will learn all about the core Spark APIs and take a comprehensive tour of advanced libraries, including Spark SQL, Spark Streaming, MLlib, and more. You will be introduced to advanced techniques and methods that will help you to construct commercial-grade data products. Focusing on a sequence of tutorials that deliver a working news intelligence service, you will learn about advanced Spark architectures, how to work with geographic data in Spark, and how to tune Spark algorithms so they scale linearly.

Who is this book for?

This book is for those who have beginner-level familiarity with the Spark architecture and data science applications, especially those who are looking for a challenge and want to learn cutting edge techniques. This book assumes working knowledge of data science, common machine learning methods, and popular data science tools, and assumes you have previously run proof of concept studies and built prototypes.

What you will learn

  • Learn the design patterns that integrate Spark into industrialized data science pipelines
  • See how commercial data scientists design scalable code and reusable code for data science services
  • Explore cutting edge data science methods so that you can study trends
  • and causality
  • Discover advanced programming techniques using RDD and the DataFrame and Dataset APIs
  • Find out how Spark can be used as a universal ingestion engine tool and as a web scraper
  • Practice the implementation of advanced topics in graph processing, such as community detection and contact chaining
  • Get to know the best practices when performing Extended Exploratory Data Analysis, commonly used in commercial data science teams
  • Study advanced Spark concepts, solution design patterns, and integration architectures
  • Demonstrate powerful data science pipelines

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Mar 29, 2017
Length: 560 pages
Edition : 1st
Language : English
ISBN-13 : 9781785882142
Vendor :
Apache
Category :
Concepts :

What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing

Product Details

Publication date : Mar 29, 2017
Length: 560 pages
Edition : 1st
Language : English
ISBN-13 : 9781785882142
Vendor :
Apache
Category :
Concepts :

Packt Subscriptions

See our plans and pricing
Modal Close icon
$19.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
$199.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts
$279.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total $ 164.97
Mastering Spark for Data Science
$60.99
Apache Spark 2.x Machine Learning Cookbook
$54.99
Learning Apache Spark 2
$48.99
Total $ 164.97 Stars icon
Banner background image

Table of Contents

14 Chapters
1. The Big Data Science Ecosystem Chevron down icon Chevron up icon
2. Data Acquisition Chevron down icon Chevron up icon
3. Input Formats and Schema Chevron down icon Chevron up icon
4. Exploratory Data Analysis Chevron down icon Chevron up icon
5. Spark for Geographic Analysis Chevron down icon Chevron up icon
6. Scraping Link-Based External Data Chevron down icon Chevron up icon
7. Building Communities Chevron down icon Chevron up icon
8. Building a Recommendation System Chevron down icon Chevron up icon
9. News Dictionary and Real-Time Tagging System Chevron down icon Chevron up icon
10. Story De-duplication and Mutation Chevron down icon Chevron up icon
11. Anomaly Detection on Sentiment Analysis Chevron down icon Chevron up icon
12. TrendCalculus Chevron down icon Chevron up icon
13. Secure Data Chevron down icon Chevron up icon
14. Scalable Algorithms Chevron down icon Chevron up icon

Customer reviews

Rating distribution
Full star icon Full star icon Full star icon Full star icon Empty star icon 4
(2 Ratings)
5 star 50%
4 star 0%
3 star 50%
2 star 0%
1 star 0%
Sumit Pal May 25, 2017
Full star icon Full star icon Full star icon Full star icon Full star icon 5
This book if for an intermediate to an expert level knowledge on Spark, Algorithms and Data Science in general. Each of the authors of the book are experts and highly accomplished craftsmen in their respective fields.The indepth coverage in the book in terms of coverage, depth, variety of algorithms and the pure fun, elegance of working with Spark and Scala code - leaves nothing more to be desired from a book of this calibre. The code is well written, and tested and explanations of the reasoning behind the code - why it is used and appropriate usage as per the algorithm makes the book highly readable. I have read numerous books on Spark for Data Processing, Streaming and Machine Learning - and this one stands out in terms of its organization, approach to solving problems in the Data Science space.I highly recommend the book. I have read the book 2 times ( while doing Technical reviewing - I was the technical reviewer of the book ) and again after it was published. I am hooked to reading it again.This book will not teach you Spark in terms of its basics, deployments, performance tuning.
Amazon Verified review Amazon
Amanda Jan 12, 2018
Full star icon Full star icon Full star icon Empty star icon Empty star icon 3
There is a definitely a market for Data Science books that are aimed at intermediate/advanced users and there is certainly a wealth of information contained within these pages. The examples were interesting enough to keep me engaged. There is the usual poor Packt editing and there were a few spelling mistakes to annoy the pedants among us.A word of caution though - don't buy this book thinking it will teach you how to use Kafka, Avro, NiFi, Accumulo - you will need to be well versed in how to use these products and link them as well as the usual Hadoop, Spark and Scala if you want to code the examples.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is included in a Packt subscription? Chevron down icon Chevron up icon

A subscription provides you with full access to view all Packt and licnesed content online, this includes exclusive access to Early Access titles. Depending on the tier chosen you can also earn credits and discounts to use for owning content

How can I cancel my subscription? Chevron down icon Chevron up icon

To cancel your subscription with us simply go to the account page - found in the top right of the page or at https://subscription.packtpub.com/my-account/subscription - From here you will see the ‘cancel subscription’ button in the grey box with your subscription information in.

What are credits? Chevron down icon Chevron up icon

Credits can be earned from reading 40 section of any title within the payment cycle - a month starting from the day of subscription payment. You also earn a Credit every month if you subscribe to our annual or 18 month plans. Credits can be used to buy books DRM free, the same way that you would pay for a book. Your credits can be found in the subscription homepage - subscription.packtpub.com - clicking on ‘the my’ library dropdown and selecting ‘credits’.

What happens if an Early Access Course is cancelled? Chevron down icon Chevron up icon

Projects are rarely cancelled, but sometimes it's unavoidable. If an Early Access course is cancelled or excessively delayed, you can exchange your purchase for another course. For further details, please contact us here.

Where can I send feedback about an Early Access title? Chevron down icon Chevron up icon

If you have any feedback about the product you're reading, or Early Access in general, then please fill out a contact form here and we'll make sure the feedback gets to the right team. 

Can I download the code files for Early Access titles? Chevron down icon Chevron up icon

We try to ensure that all books in Early Access have code available to use, download, and fork on GitHub. This helps us be more agile in the development of the book, and helps keep the often changing code base of new versions and new technologies as up to date as possible. Unfortunately, however, there will be rare cases when it is not possible for us to have downloadable code samples available until publication.

When we publish the book, the code files will also be available to download from the Packt website.

How accurate is the publication date? Chevron down icon Chevron up icon

The publication date is as accurate as we can be at any point in the project. Unfortunately, delays can happen. Often those delays are out of our control, such as changes to the technology code base or delays in the tech release. We do our best to give you an accurate estimate of the publication date at any given time, and as more chapters are delivered, the more accurate the delivery date will become.

How will I know when new chapters are ready? Chevron down icon Chevron up icon

We'll let you know every time there has been an update to a course that you've bought in Early Access. You'll get an email to let you know there has been a new chapter, or a change to a previous chapter. The new chapters are automatically added to your account, so you can also check back there any time you're ready and download or read them online.

I am a Packt subscriber, do I get Early Access? Chevron down icon Chevron up icon

Yes, all Early Access content is fully available through your subscription. You will need to have a paid for or active trial subscription in order to access all titles.

How is Early Access delivered? Chevron down icon Chevron up icon

Early Access is currently only available as a PDF or through our online reader. As we make changes or add new chapters, the files in your Packt account will be updated so you can download them again or view them online immediately.

How do I buy Early Access content? Chevron down icon Chevron up icon

Early Access is a way of us getting our content to you quicker, but the method of buying the Early Access course is still the same. Just find the course you want to buy, go through the check-out steps, and you’ll get a confirmation email from us with information and a link to the relevant Early Access courses.

What is Early Access? Chevron down icon Chevron up icon

Keeping up to date with the latest technology is difficult; new versions, new frameworks, new techniques. This feature gives you a head-start to our content, as it's being created. With Early Access you'll receive each chapter as it's written, and get regular updates throughout the product's development, as well as the final course as soon as it's ready.We created Early Access as a means of giving you the information you need, as soon as it's available. As we go through the process of developing a course, 99% of it can be ready but we can't publish until that last 1% falls in to place. Early Access helps to unlock the potential of our content early, to help you start your learning when you need it most. You not only get access to every chapter as it's delivered, edited, and updated, but you'll also get the finalized, DRM-free product to download in any format you want when it's published. As a member of Packt, you'll also be eligible for our exclusive offers, including a free course every day, and discounts on new and popular titles.