What do you get with eBook?

Instant access to your Digital eBook purchase

Download this book in EPUB and PDF formats

Access this title in our online reader with advanced features

DRM FREE - Read whenever, wherever and however you want

Apache Spark SQL

In this chapter, we will examine ApacheSparkSQL, SQL, DataFrames, and Datasets on top of Resilient Distributed Datasets (RDDs). DataFrames were introduced in Spark 1.3, basically replacing SchemaRDDs, and are columnar data storage structures roughly equivalent to relational database tables, whereas Datasets were introduced as experimental in Spark 1.6 and have become an additional component in Spark 2.0.

We have tried to reduce the dependency between individual chapters as much as possible in order to give you the opportunity to work through them as you like. However, we do recommend that you read this chapter because the other chapters are dependent on the knowledge of DataFrames and Datasets.

This chapter will cover the following topics:

SparkSession
Importing and saving data
Processing the text files
Processing the JSON files
Processing the Parquet files
DataSource...

Key benefits

Master the art of real-time Big Data processing using Apache Spark 2.x

Perform machine learning, deep learning and streaming data analytics by extending the most up-to-date functionalities of Apache Spark

An advanced guide with a unique combination of tips, instructions and practical examples on using Apache Spark effectively

Description

Apache Spark is an in-memory, cluster-based Big Data processing system that provides a wide range of functionalities such as graph processing, machine learning, stream processing, and more. This book will take your knowledge of Apache Spark to the next level by teaching you how to expand Spark’s functionality and build your data flows and machine/deep learning programs on top of the platform. The book starts with a quick overview of the Apache Spark ecosystem, and introduces you to the new features and capabilities in Apache Spark 2.x. You will then work with the different modules in Apache Spark such as interactive querying with Spark SQL, using DataFrames and DataSets effectively, streaming analytics with Spark Streaming, and performing machine learning and deep learning on Spark using MLlib and external tools such as H20 and Deeplearning4j. The book also contains chapters on efficient graph processing, memory management and using Apache Spark on the cloud. By the end of this book, you will have all the necessary information to master Apache Spark, and use it efficiently for Big Data processing and analytics.

Who is this book for?

If you are an intermediate-level Spark developer looking to master the advanced capabilities and use-cases of Apache Spark 2.x, this book is for you. Big Data professionals who wish to know how to integrate and use the features of Apache Spark to build a strong Big Data pipeline will also find this book to be a useful resource. A fundamental knowledge of Apache Spark and the Scala programming language is assumed.

What you will learn

• Get to grips with the newly introduced features in Apache Spark 2.x

• Perform highly optimised unified batch and real-time data processing using

SparkSQL and Structured Streaming

• Evaluate large-scale Graph Processing and Analysis using GraphX and GraphFrames

• Perform advanced machine learning and deep learning with Spark MLlib, SparkML, SystemML, H2O and DeepLearning4J

• Learn how specific parameter settings affect overall performance of an

Apache Spark cluster

• Apply Apache Spark in Elastic deployments using Jupyter and Zeppelin Notebooks, Docker, Kubernetes and the IBM Cloud

What do you get with eBook?

Instant access to your Digital eBook purchase

Download this book in EPUB and PDF formats

Access this title in our online reader with advanced features

DRM FREE - Read whenever, wherever and however you want

Frequently bought together

€41.99

Mastering Machine Learning with Spark 2.x

€41.99

€53.99

Total € 137.97

M.R. Oct 14, 2017

Romeo Kienzler takes the reader on a big and detailed tour through significant Spark topics and exercises, which occur in the practical usage of Spark in Big Data, Analytics, Data Science and Analytic Data Warehouse ("ADW") projects. In his book topics like the new Spark V2 Ecosystem, Machine Learning, Spark Streaming, Graph Processing, Cluster Design and Management (Yarn and Mesos), Cloud based deployments, Performance topics around HDFS, Date importing and handling, Spark Data Source API, Spark Dataframes and Datasets API, Code Generation for expression evaluation, Project Tungsten, Spark error handling and much more are covered. If you have taken one or more of the well done Spark courses from Databricks before, the topics might familiar but the book covers even some more enhanced topics as well it can be taken as a good comprehension or as in-depth notes. Additionally the book focus on very specific details and problems in parallel programming with Spark, derived from practical use cases.As well the book contains links and references on papers, literature and web forums. To summarize I would recommend this book as an excellent starting point and Spark reference guide.

Amazon Verified review

Dr. Raj Kamal . Aug 30, 2018

Good book to start Spark. Helped me greatly to finish my upcoming book from McGraw-Hill on Big Data Anaytics. My students work and do Big Data data sets analysis using Spark

Mastering Apache Spark 2.x: Advanced techniques in complex Big Data processing, streaming analytics and machine learning , Second Edition

What do you get with eBook?

Mastering Apache Spark 2.x

Apache Spark SQL

The SparkSession--your gateway to structured data processing

Importing and saving data

Processing the text files

Understanding the DataSource API

DataFrames

Using SQL

Defining schemas manually

Using Datasets

The SparkSession--your gateway to structured data processing

Note

Importing and saving data

Processing the text files

Understanding the DataSource API

Implicit schema discovery

Page 1 of 10

Key benefits

Description

Who is this book for?

What you will learn

Product Details

What do you get with eBook?

Product Details

Frequently bought together

Table of Contents

Recommendations for you

Customer reviews

People who bought this also bought

About the author

FAQs

Mastering Apache Spark 2.x: Advanced techniques in complex Big Data processing, streaming analytics and machine learning , Second Edition

What do you get with eBook?

Contact Details

Billing Address

Key benefits

Description

Who is this book for?

What you will learn

Product Details

What do you get with eBook?

Contact Details

Billing Address

Product Details

Packt Subscriptions

Frequently bought together

Table of Contents

Recommendations for you

Customer reviews

People who bought this also bought

About the author

FAQs