Packt+ | Advance your knowledge in tech

You're reading from Scala and Spark for Big Data Analytics Explore the concepts of functional programming, data streaming, and machine learning

Product type Paperback

Published in Jul 2017

Publisher Packt

ISBN-13 9781785280849

Length 796 pages

Edition 1st Edition

Languages

Scala

Tools

Apache Spark

Concepts

Big Data

Authors (2):

Sridhar Alla

Md. Rezaul Karim

View More author details

Chapter 1, Introduction to Scala, will teach big data analytics using the Scala-based APIs of Spark. Spark itself is written with Scala and naturally, as a starting point, we will discuss a brief introduction to Scala, such as the basic aspects of its history, purposes, and how to install Scala on Windows, Linux, and Mac OS. After that, the Scala web framework will be discussed in brief. Then, we will provide a comparative analysis of Java and Scala. Finally, we will dive into Scala programming to get started with Scala.

Chapter 2, Object-Oriented Scala, says that the object-oriented programming (OOP) paradigm provides a whole new layer of abstraction. In short, this chapter discusses some of the greatest strengths of OOP languages: discoverability, modularity, and extensibility. In particular, we will see how to deal with variables in Scala; methods, classes, and objects in Scala; packages and package objects; traits and trait linearization; and Java interoperability.

Chapter 3, Functional Programming Concepts, showcases the functional programming concepts in Scala. More specifically, we will learn several topics, such as why Scala is an arsenal for the data scientist, why it is important to learn the Spark paradigm, pure functions, and higher-order functions (HOFs). A real-life use case using HOFs will be shown too. Then, we will see how to handle exceptions in higher-order functions outside of collections using the standard library of Scala. Finally, we will look at how functional Scala affects an object's mutability.

Chapter4, Collection APIs, introduces one of the features that attract most Scala users--the Collections API. It's very powerful and flexible, and has lots of operations coupled. We will also demonstrate the capabilities of the Scala Collection API and how it can be used in order to accommodate different types of data and solve a wide range of different problems. In this chapter, we will cover Scala collection APIs, types and hierarchy, some performance characteristics, Java interoperability, and Scala implicits.

Chapter 5, Tackle Big Data - Spark Comes to the Party, outlines data analysis and big data; we see the challenges that big data poses, how they are dealt with by distributed computing, and the approaches suggested by functional programming. We introduce Google's MapReduce, Apache Hadoop, and finally, Apache Spark, and see how they embraced this approach and these techniques. We will look into the evolution of Apache Spark: why Apache Spark was created in the first place and the value it can bring to the challenges of big data analytics and processing.

Chapter 6, Start Working with Spark - REPL and RDDs, covers how Spark works; then, we introduce RDDs, the basic abstractions behind Apache Spark, and see that they are simply distributed collections exposing Scala-like APIs. We will look at the deployment options for Apache Spark and run it locally as a Spark shell. We will learn the internals of Apache Spark, what RDDs are, DAGs and lineages of RDDs, Transformations, and Actions.

Chapter 7, Special RDD Operations, focuses on how RDDs can be tailored to meet different needs, and how these RDDs provide new functionalities (and dangers!) Moreover, we investigate other useful objects that Spark provides, such as broadcast variables and Accumulators. We will learn aggregation techniques, shuffling.

Chapter 8, Introduce a Little Structure - SparkSQL, teaches how to use Spark for the analysis of structured data as a higher-level abstraction of RDDs and how Spark SQL's APIs make querying structured data simple yet robust. Moreover, we introduce datasets and look at the differences between datasets, DataFrames, and RDDs. We will also learn to join operations and window functions to do complex data analysis using DataFrame APIs.

Chapter 9, Stream Me Up, Scotty - Spark Streaming, takes you through Spark Streaming and how we can take advantage of it to process streams of data using the Spark API. Moreover, in this chapter, the reader will learn various ways of processing real-time streams of data using a practical example to consume and process tweets from Twitter. We will look at integration with Apache Kafka to do real-time processing. We will also look at structured streaming, which can provide real-time queries to your applications.

Chapter 10, Everything is Connected - GraphX, in this chapter, we learn how many real-world problems can be modeled (and resolved) using graphs. We will look at graph theory using Facebook as an example, Apache Spark's graph processing library GraphX, VertexRDD and EdgeRDDs, graph operators, aggregateMessages, TriangleCounting, the Pregel API, and use cases such as the PageRank algorithm.

Chapter 11, Learning Machine Learning - Spark MLlib and ML, the purpose of this chapter is to provide a conceptual introduction to statistical machine learning. We will focus on Spark's machine learning APIs, called Spark MLlib and ML. We will then discuss how to solve classification tasks using decision trees and random forest algorithms and regression problem using linear regression algorithm. We will also show how we could benefit from using one-hot encoding and dimensionality reductions algorithms in feature extraction before training a classification model. In later sections, we will show a step-by-step example of developing a collaborative filtering-based movie recommendation system.

Chapter 12, My Name is Bayes, Naive Bayes, states that machine learning in big data is a radical combination that has created great impact in the field of research, in both academia and industry. Big data imposes great challenges on ML, data analytics tools, and algorithms to find the real value. However, making a future prediction based on these huge datasets has never been easy. Considering this challenge, in this chapter, we will dive deeper into ML and find out how to use a simple yet powerful method to build a scalable classification model and concepts such as multinomial classification, Bayesian inference, Naive Bayes, decision trees, and a comparative analysis of Naive Bayes versus decision trees.

Chapter 13, Time to Put Some Order - Cluster Your Data with Spark MLlib, gets you started on how Spark works in cluster mode with its underlying architecture. In previous chapters, we saw how to develop practical applications using different Spark APIs. Finally, we will see how to deploy a full Spark application on a cluster, be it with a pre-existing Hadoop installation or without.

Chapter 14, Text Analytics Using Spark ML, outlines the wonderful field of text analytics using Spark ML. Text analytics is a wide area in machine learning and is useful in many use cases, such as sentiment analysis, chat bots, email spam detection, natural language processing, and many many more. We will learn how to use Spark for text analysis with a focus on use cases of text classification using a 10,000 sample set of Twitter data. We will also look at LDA, a popular technique to generate topics from documents without knowing much about the actual text, and will implement text classification on Twitter data to see how it all comes together.

Chapter 15, Spark Tuning, digs deeper into Apache Spark internals and says that while Spark is great in making us feel as if we are using just another Scala collection, we shouldn't forget that Spark actually runs in a distributed system. Therefore, throughout this chapter, we will cover how to monitor Spark jobs, Spark configuration, common mistakes in Spark app development, and some optimization techniques.

Chapter 16, Time to Go to ClusterLand - Deploying Spark on a Cluster, explores how Spark works in cluster mode with its underlying architecture. We will see Spark architecture in a cluster, the Spark ecosystem and cluster management, and how to deploy Spark on standalone, Mesos, Yarn, and AWS clusters. We will also see how to deploy your app on a cloud-based AWS cluster.

Chapter 17, Testing and Debugging Spark, explains how difficult it can be to test an application if it is distributed; then, we see some ways to tackle this. We will cover how to do testing in a distributed environment, and testing and debugging Spark applications.

Chapter 18, PySpark & SparkR, covers the other two popular APIs for writing Spark code using R and Python, that is, PySpark and SparkR. In particular, we will cover how to get started with PySpark and interacting with DataFrame APIs and UDFs with PySpark, and then we will do some data analytics using PySpark. The second part of this chapter covers how to get started with SparkR. We will also see how to do data processing and manipulation, and how to work with RDD and DataFrames using SparkR, and finally, some data visualization using SparkR.

Chapter 19, Advanced Machine Learning Best Practices, provides theoretical and practical aspects of some advanced topics of machine learning with Spark. We will see how to tune machine learning models for optimized performance using grid search, cross-validation, and hyperparameter tuning. In a later section, we will cover how to develop a scalable recommendation system using ALS, which is an example of a model-based recommendation algorithm. Finally, a topic modelling application will be demonstrated as a text clustering technique

Appendix A, Accelerating Spark with Alluxio, shows how to use Alluxio with Spark to increase the speed of processing. Alluxio is an open source distributed memory storage system useful for increasing the speed of many applications across platforms, including Apache Spark. We will explore the possibilities of using Alluxio and how Alluxio integration will provide greater performance without the need to cache the data in memory every time we run a Spark job.

Appendix B, Interactive Data Analytics with Apache Zeppelin, says that from a data science perspective, interactive visualization of your data analysis is also important. Apache Zeppelin is a web-based notebook for interactive and large-scale data analytics with multiple backends and interpreters. In this chapter, we will discuss how to use Apache Zeppelin for large-scale data analytics using Spark as the interpreter in the backend.

Chapter 19 and Appendices are not present in the book but are available for download
at the following link: https://www.packtpub.com/sites/default/files/downloads/ScalaandSparkforBigDataAnalytics_OnlineChapter_Appendices.pdf.