Big Data Analytics

Chapter 1. Big Data Analytics at a 10,000-Foot View

The goal of this book is to familiarize you with tools and techniques using Apache Spark, with a focus on Hadoop deployments and tools used on the Hadoop platform. Most production implementations of Spark use Hadoop clusters and users are experiencing many integration challenges with a wide variety of tools used with Spark and Hadoop. This book will address the integration challenges faced with Hadoop Distributed File System (HDFS) and Yet Another Resource Negotiator (YARN) and explain the various tools used with Spark and Hadoop. This will also discuss all the Spark components—Spark Core, Spark SQL, DataFrames, Datasets, Spark Streaming, Structured Streaming, MLlib, GraphX, and SparkR and integration with analytics components such as Jupyter, Zeppelin, Hive, HBase, and dataflow tools such as NiFi. A real-time example of a recommendation system using MLlib will help us understand data science techniques.

In this chapter, we will approach Big Data analytics from a broad perspective and try to understand what tools and techniques are used on the Apache Hadoop and Apache Spark platforms.

Big Data analytics is the process of analyzing Big Data to provide past, current, and future statistics and useful insights that can be used to make better business decisions.

Big Data analytics is broadly classified into two major categories, data analytics and data science, which are interconnected disciplines. This chapter will explain the differences between data analytics and data science. Current industry definitions for data analytics and data science vary according to their use cases, but let's try to understand what they accomplish.

Data analytics focuses on the collection and interpretation of data, typically with a focus on past and present statistics. Data science, on the other hand, focuses on the future by performing explorative analytics to provide recommendations based on models identified by past and present data.

Figure 1.1 explains the difference between data analytics and data science with respect to time and value achieved. It also shows typical questions asked and tools and techniques used. Data analytics has mainly two types of analytics, descriptive analytics and diagnostic analytics. Data science has two types of analytics, predictive analytics and prescriptive analytics. The following diagram explains data science and data analytics:

Big Data Analytics at a 10,000-Foot View

Figure 1.1: Data analytics versus data science

The following table explains the differences with respect to processes, tools, techniques, skill sets, and outputs:

	Data analytics	Data science
Perspective	Looking backward	Looking forward
Nature of work	Report and optimize	Explore, discover, investigate, and visualize
Output	Reports and dashboards	Data product
Typical tools used	Hive, Impala, Spark SQL, and HBase	MLlib and Mahout
Typical techniques used	ETL and exploratory analytics	Predictive analytics and sentiment analytics
Typical skill set necessary	Data engineering, SQL, and programming	Statistics, machine learning, and programming

This chapter will cover the following topics:

Big Data analytics and the role of Hadoop and Spark
Big Data science and the role of Hadoop and Spark
Tools and techniques
Real-life use cases

Big Data analytics and the role of Hadoop and Spark

Conventional data analytics uses Relational Database Management Systems (RDBMS) databases to create data warehouses and data marts for analytics using business intelligence tools. RDBMS databases use the Schema-on-Write approach; there are many downsides for this approach.

Traditional data warehouses were designed to Extract, Transform, and Load (ETL) data in order to answer a set of predefined questions, which are directly related to user requirements. Predefined questions are answered using SQL queries. Once the data is transformed and loaded in a consumable format, it becomes easier for users to access it with a variety of tools and applications to generate reports and dashboards. However, creating data in a consumable format requires several steps, which are listed as follows:

Deciding predefined questions.
Identifying and collecting data from source systems.
Creating ETL pipelines to load the data into the analytic database in a consumable format.

If new questions arise, systems need to identify and add new data sources and create new ETL pipelines. This involves schema changes in databases and the effort of implementation typically ranges from one to six months. This is a big constraint and forces the data analyst to operate in predefined boundaries only.

Transforming data into a consumable format generally results in losing raw/atomic data that might have insights or clues to the answers that we are looking for.

Processing structured and unstructured data is another challenge in traditional data warehousing systems. Storing and processing large binary images or videos effectively is always a challenge.

Big Data analytics does not use relational databases; instead, it uses the Schema-on-Read (SOR) approach on the Hadoop platform using Hive and HBase typically. There are many advantages of this approach. Figure 1.2 shows the Schema-on-Write and Schema-on-Read scenarios:

Big Data analytics and the role of Hadoop and Spark

Figure 1.2: Schema-on-Write versus Schema-on-Read

The Schema-on-Read approach introduces flexibility and reusability to systems. The Schema-on-Read paradigm emphasizes storing the data in a raw, unmodified format and applying a schema to the data as needed, typically while it is being read or processed. This approach allows considerably more flexibility in the amount and type of data that can be stored. Multiple schemas can be applied to the same raw data to ask a variety of questions. If new questions need to be answered, just get the new data and store it in a new directory of HDFS and start answering new questions.

This approach also provides massive flexibility over how the data can be consumed with multiple approaches and tools. For example, the same raw data can be analyzed using SQL analytics or complex Python or R scripts in Spark. As we are not storing data in multiple layers, which is needed for ETL, so the storage cost and data movement cost is reduced. Analytics can be done for unstructured and structured data sources along with structured data sources.

A typical Big Data analytics project life cycle

The life cycle of Big Data analytics using Big Data platforms such as Hadoop is similar to traditional data analytics projects. However, a major paradigm shift is using the Schema-on-Read approach for the data analytics.

A Big Data analytics project involves the activities shown in Figure 1.3:

A typical Big Data analytics project life cycle

Figure 1.3: The Big Data analytics life cycle

Identifying the problem and outcomes

Identify the business problem and desired outcome of the project clearly so that it scopes in what data is needed and what analytics can be performed. Some examples of business problems are company sales going down, customers visiting the website but not buying products, customers abandoning shopping carts, a sudden rise in support call volume, and so on. Some examples of project outcomes are improving the buying rate by 10%, decreasing shopping cart abandonment by 50%, and reducing support call volume by 50% by the next quarter while keeping customers happy.

Identifying the necessary data

Identify the quality, quantity, format, and sources of data. Data sources can be data warehouses (OLAP), application databases (OLTP), log files from servers, documents from the Internet, and data generated from sensors and network hubs. Identify all the internal and external data source requirements. Also, identify the data anonymization and re-identification requirements of data to remove or mask personally identifiable information (PII).

Data collection

Collect data from relational databases using the Sqoop tool and stream data using Flume. Consider using Apache Kafka for reliable intermediate storage. Design and collect data considering fault tolerance scenarios.

Preprocessing data and ETL

Data comes in different formats and there can be data quality issues. The preprocessing step converts the data to a needed format or cleanses inconsistent, invalid, or corrupt data. The performing analytics phase will be initiated once the data conforms to the needed format. Apache Hive, Apache Pig, and Spark SQL are great tools for preprocessing massive amounts of data.

This step may not be needed in some projects if the data is already in a clean format or analytics are performed directly on the source data with the Schema-on-Read approach.

Performing analytics

Analytics are performed in order to answer business questions. This requires an understanding of data and relationships between data points. The types of analytics performed are descriptive and diagnostic analytics to present the past and current views on the data. This typically answers questions such as what happened and why it happened. In some cases, predictive analytics is performed to answer questions such as what would happen based on a hypothesis.

Apache Hive, Pig, Impala, Drill, Tez, Apache Spark, and HBase are great tools for data analytics in batch processing mode. Real-time analytics tools such as Impala, Tez, Drill, and Spark SQL can be integrated into traditional business intelligence tools (Tableau, Qlikview, and others) for interactive analytics.

Visualizing data

Data visualization is the presentation of analytics output in a pictorial or graphical format to understand the analysis better and make business decisions based on the data.

Typically, finished data is exported from Hadoop to RDBMS databases using Sqoop for integration into visualization systems or visualization systems are directly integrated into tools such as Tableau, Qlikview, Excel, and so on. Web-based notebooks such as Jupyter, Zeppelin, and Databricks cloud are also used to visualize data by integrating Hadoop and Spark components.

The role of Hadoop and Spark

Hadoop and Spark provide you with great flexibility in Big Data analytics:

Large-scale data preprocessing; massive datasets can be preprocessed with high performance
Exploring large and full datasets; the dataset size does not matter
Accelerating data-driven innovation by providing the Schema-on-Read approach
A variety of tools and APIs for data exploration

Big Data science and the role of Hadoop and Spark

Data science is all about the following two aspects:

Extracting deep meaning from the data
Creating data products

Extracting deep meaning from data means fetching the value using statistical algorithms. A data product is a software system whose core functionality depends on the application of statistical analysis and machine learning to the data. Google AdWords or Facebook's People You May Know are a couple of examples of data products.

A fundamental shift from data analytics to data science

A fundamental shift from data analytics to data science is due to the rising need for better predictions and creating better data products.

Let's consider an example use case that explains the difference between data analytics and data science.

Problem: A large telecoms company has multiple call centers that collect caller information and store it in databases and filesystems. The company has already implemented data analytics on the call center data, which provided the following insights:

Service availability
The average speed of answering, average hold time, average wait time, and average call time
The call abandon rate
The first call resolution rate and cost per call
Agent occupancy

Now, the telecoms company would like to reduce the customer churn, improve customer experience, improve service quality, and cross-sell and up-sell by understanding the customers in near real-time.

Solution: Analyze the customer voice. The customer voice has deeper insights than any other information. Convert all calls to text using tools such as CMU Sphinx and scale out on the Hadoop platform. Perform text analytics to derive insights from the data, to gain high accuracy in call-to-text conversion, create models (language and acoustic) that are suitable for the company, and retrain models on a frequent basis with any changes. Also, create models for text analytics using machine learning and natural language processing (NLP) to come up with the following metrics while combining data analytics metrics:

Top reasons for customer churn
Customer sentiment analysis
Customer and problem segmentation
360-degree view of the customer

Notice that the business requirement of this use case created a fundamental shift from data analytics to data science implementing machine learning and NLP algorithms. To implement this solution, new tools and techniques are used and a new role, data scientist, is needed.

A data scientist has a combination of multiple skill sets—statistics, software programming, and business expertise. Data scientists create data products and extract value from the data. Let's see how data scientists differ from other roles. This will help us in understanding roles and tasks performed in data science and data analytics projects.

Data scientists versus software engineers

The difference between the data scientist and software engineer roles is as follows:

Software engineers develop general-purpose software for applications based on business requirements
Data scientists don't develop application software, but they develop software to help them solve problems
Typically, software engineers use Java, C++, and C# programming languages
Data scientists tend to focus more on scripting languages such as Python and R

Data scientists versus data analysts

The difference between the data scientist and data analyst roles is as follows:

Data analysts perform descriptive and diagnostic analytics using SQL and scripting languages to create reports and dashboards.
Data scientists perform predictive and prescriptive analytics using statistical techniques and machine learning algorithms to find answers. They typically use tools such as Python, R, SPSS, SAS, MLlib, and GraphX.

Data scientists versus business analysts

The difference between the data scientist and business analyst roles is as follows:

Both have a business focus, so they may ask similar questions
Data scientists have the technical skills to find answers

A typical data science project life cycle

Let's learn how to approach and execute a typical data science project.

The typical data science project life cycle shown in Figure 1.4 explains that a data science project's life cycle is iterative, but a data analytics project's life cycle, as shown in Figure 1.3, is not iterative. Defining problems and outcomes and communicating phases are not in the iterations while improving the outcomes of the project. However, the overall project life cycle is iterative, which needs to be improved from time to time after production implementation.

A typical data science project life cycle

Figure 1.4: A data science project life cycle

Defining problems and outcomes in the data preprocessing phase is similar to the data analytics project, which is explained in Figure 1.3. So, let's discuss the new steps required for data science projects.

Hypothesis and modeling

Given the problem, consider all the possible solutions that could match the desired outcome. This typically involves a hypothesis about the root cause of the problem. So, questions around the business problem arise, such as why customers are canceling the service, why support calls are increasing significantly, and why customers are abandoning shopping carts.

A hypothesis would identify the appropriate model given a deeper understanding of the data. This involves understanding the attributes of the data and their relationships and building the environment for the modeling by defining datasets for testing, training, and production. Create the appropriate model using machine learning algorithms such as logistic regression, k-means clustering, decision trees, or Naive Bayes.

Measuring the effectiveness

Execute the model by running the identified model against the datasets. Measure the effectiveness of the model by checking the results against the desired outcome. Use test data to verify the results and create metrics such as Mean Squared Error (MSE) to measure effectiveness.

Making improvements

Measurements will illustrate how much improvement is required. Consider what you might change. You can ask yourself the following questions: 

Was the hypothesis around the root cause correct?
Ingesting additional datasets would provide better results?
Would other solutions provide better results?

Once you've implemented your improvements, test them again and compare them with the previous measurements in order to refine the solution further.

Communicating the results

Communication of the results is an important step in the data science project life cycle. The data scientist tells the story found within the data by correlating the story to business problems. Reports and dashboards are common tools to communicate the results.

The role of Hadoop and Spark

Apache Hadoop provides you with distributed storage and resource management, while Spark provides you with in-memory performance for data science applications. Hadoop and Spark have the following advantages for data science projects:

A wide range of applications and third-party packages
A machine learning algorithms library for easy usage
Spark integrations with deep learning libraries such as H2O and TensorFlow
Scala, Python, and R for interactive analytics using the shell
A unification feature—using SQL, machine learning, and streaming together

Tools and techniques

Let's take a look at different tools and techniques used in Hadoop and Spark for Big Data analytics.

While the Hadoop platform can be used for both storing and processing the data, Spark can be used for processing only by reading data into memory.

The following is a tabular representation of the tools and techniques used in typical Big Data analytics projects:

Tools used

Techniques used

Data collection

Apache Flume for real-time data collection and aggregation

Apache Sqoop for data import and export from relational data stores and NoSQL databases

Apache Kafka for the publish-subscribe messaging system

General-purpose tools such as FTP/Copy

Real-time data capture

Export

Import

Message publishing

Data APIs

Screen scraping

Data storage and formats

HDFS: Primary storage of Hadoop

HBase: NoSQL database

Parquet: Columnar format

Avro: Serialization system on Hadoop

Sequence File: Binary key-value pairs

RC File: First columnar format in Hadoop

ORC File: Optimized RC File

XML and JSON: Standard data interchange formats

Compression formats: Gzip, Snappy, LZO, Bzip2, Deflate, and others

Unstructured Text, images, videos, and so on

Data storage

Data archival

Data compression

Data serialization

Schema evolution

Data transformation and enrichment

MapReduce: Hadoop's processing framework

Spark: Compute engine

Hive: Data warehouse and querying

Pig: Data flow language

Python: Functional programming

Crunch, Cascading, Scalding, and Cascalog: Special MapReduce tools

Data munging

Filtering

Joining

ETL

File format conversion

Anonymization

Re-identification

Data analytics

Hive: Data warehouse and querying

Pig: Data flow language

Tez: Alternative to MapReduce

Impala: Alternative to MapReduce

Drill: Alternative to MapReduce

Apache Storm: Real-time compute engine

Spark Core: Spark core compute engine

Spark Streaming: Real-time compute engine

Spark SQL: For SQL analytics

SolR: Search platform

Apache Zeppelin: Web-based notebook

Jupyter Notebooks

Databricks cloud

Apache NiFi: Data flow

Spark-on-HBase connector

Programming languages: Java, Scala, and Python

Online Analytical Processing (OLAP)

Data mining

Data visualization

Complex event processing

Real-time stream processing

Full text search

Interactive data analytics

Data science

Python: Functional programming

R: Statistical computing language

Mahout: Hadoop's machine learning library

MLlib: Spark's machine learning library

GraphX and GraphFrames: Spark's graph processing framework and DataFrame adoption to graphs.

Predictive analytics

Sentiment analytics

Text and Natural Language Processing

Network analytics

Cluster analytics

Real-life use cases

Let's take a look at different kinds of use cases for Big Data analytics. Broadly, Big Data analytics use cases are classified into the following five categories:

Customer analytics: Data-driven customer insights are necessary to deepen relationships and improve revenue.
Operational analytics: Performance and high service quality are the keys to maintaining customers in any industry, from manufacturing to health services.
Data-driven products and services: New products and services that align with growing business demands.
Enterprise Data Warehouse (EDW) optimization: Early data adopters have warehouse architectures that are 20 years old. Businesses modernize EDW architectures in order to handle the data deluge.
Domain-specific solutions: Domain-specific solutions provide businesses with an effective way to implement new features or adhere to industry compliance.

The following table shows you typical use cases of Big Data analytics:

Problem class	Use cases	Data analytics or data science?
Customer analytics	A 360-degree view of the customer	Data analytics and data science
	Call center analytics	Data analytics and data science
	Sentiment analytics	Data science
	Recommendation engine (for example, the next best action)	Data science
Operational analytics	Log analytics	Data analytics
	Call center analytics	Data analytics
	Unstructured data management	Data analytics
	Document management	Data analytics
	Network analytics	Data analytics and data science
	Preventive maintenance	Data science
	Geospatial data management	Data analytics and data science
	IOT Analytics	Data analytics and data science
Data-driven products and services	Metadata management	Data analytics
	Operational data services	Data analytics
	Data/Big Data environments	Data analytics
	Data marketplaces	Data analytics
	Third-party data management	Data analytics
EDW optimization	Data warehouse offload	Data analytics
	Structured Big Data lake	Data analytics
	Licensing cost mitigation	Data analytics
	Cloud data architectures	Data analytics
	Software assessments and migrations	Data analytics
Domain-specific solutions	Fraud and compliance	Data analytics and data science
	Industry-specific domain models	Data analytics
	Data sourcing and integration	Data analytics
	Metrics and reporting solutions	Data analytics
	Turnkey warehousing solutions	Data analytics

Filter reviews by

All

Amazon verified reviews

Ravi Oct 11, 2016

The big data analytics is very helpful book for anyone familiar with hadoop technologies and also for beginners learning spark ecosystem.

Amazon Verified review

Subbaraju Cherukuri Oct 11, 2016

Big data had remained an enigma to many. This book by Venkat Ankam, a highly experienced & well respected Big data trainer and Architect deals with this very premise and bares it to turn it upside down. He uses simple, regularly used instructional language constructs to unravel the most popular big data technologies of Hadoop & Spark. The reader gets immersed into it as into a popular work of fiction. So engrossing is his style of presentation that one generally does not care for a few grammatical lacunae. It is so comprehensive that if you did not find a concept in it, you may treat that it is still in incubation. The author deals with both data analytics and data science in intricate detail and with mesmerising alacrity. The range of case studies discussed cover all one might ever come across."The bible for every Hadoop and Spark engineer."

Amazon Customer Oct 11, 2016

"Great book I read after a long time, Nice examples are provided, content provided in easily understandable format. I am very glad that I read this book.Thanks. Amruth puppala

Vin Oct 11, 2016

This is excellent book forbig data analytics concepts and It covered all in detail.

venkat Oct 10, 2016

The book has covered all aspects of the big data concepts that are required for anyone to compete in the big data market.

Big Data Analytics: Real time analytics using Apache Spark and Hadoop

What do you get with a Packt Subscription?

A typical Big Data analytics project life cycle

Identifying the problem and outcomes

Identifying the necessary data

Data collection

Preprocessing data and ETL

Performing analytics

Visualizing data

The role of Hadoop and Spark

Data scientists versus software engineers

Data scientists versus data analysts

Data scientists versus business analysts

Hypothesis and modeling

Measuring the effectiveness

Making improvements

Communicating the results

Key benefits

Description

Who is this book for?

What you will learn

Product Details

What do you get with a Packt Subscription?

Product Details

Packt Subscriptions

Frequently bought together

Table of Contents

Recommendations for you

Customer reviews

Filter reviews by

People who bought this also bought

About the author

FAQs