Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Azure Databricks Cookbook

You're reading from   Azure Databricks Cookbook Accelerate and scale real-time analytics solutions using the Apache Spark-based analytics service

Arrow left icon
Product type Paperback
Published in Sep 2021
Publisher Packt
ISBN-13 9781789809718
Length 452 pages
Edition 1st Edition
Languages
Tools
Arrow right icon
Authors (2):
Arrow left icon
Vinod Jaiswal Vinod Jaiswal
Author Profile Icon Vinod Jaiswal
Vinod Jaiswal
Phani Raj Phani Raj
Author Profile Icon Phani Raj
Phani Raj
Arrow right icon
View More author details
Toc

Table of Contents (12) Chapters Close

Preface 1. Chapter 1: Creating an Azure Databricks Service 2. Chapter 2: Reading and Writing Data from and to Various Azure Services and File Formats FREE CHAPTER 3. Chapter 3: Understanding Spark Query Execution 4. Chapter 4: Working with Streaming Data 5. Chapter 5: Integrating with Azure Key Vault, App Configuration, and Log Analytics 6. Chapter 6: Exploring Delta Lake in Azure Databricks 7. Chapter 7: Implementing Near-Real-Time Analytics and Building a Modern Data Warehouse 8. Chapter 8: Databricks SQL 9. Chapter 9: DevOps Integrations and Implementing CI/CD for Azure Databricks 10. Chapter 10: Understanding Security and Monitoring in Azure Databricks 11. Other Books You May Enjoy

Reading and writing data from and to JSON, including nested JSON 

Spark SQL automatically detects the JSON dataset schema from the files and loads it as a DataFrame. It also provides an option to query JSON data for reading and writing data. Nested JSON can also be parsed, and fields can be directly accessed without any explicit transformations.

Getting ready

You can follow the steps by running the steps in the 2_8.Reading and Writing data from and to Json including nested json.iynpb notebook in your local cloned repository in the Chapter02 folder.

Upload the folder JsonData from Chapter02/sensordata folder to ADLS Gen-2 account having sensordata as file system . We are mounting ADLS Gen-2 Storage Account with sensordata file system to /mnt/SensorData.

The JsonData has two folders, SimpleJsonData which has files simple JSON structure and JsonData folder which has files with nested JSON structure.

Note

The code was tested on Databricks Runtime Version 7.3 LTS having Spark 3.0.1.

In the upcoming section we will learn how to process simple and complex JSON datafile. We will use sensordata files with simple and nested schema.

How to do it…

In this section, you will see how you can read and write the simple JSON data files:

  1. You can read JSON datafiles using below code snippet. You need to specify multiline option as true when you are reading JSON file having multiple lines else if its single line JSON datafile this can be skipped.
    df_json = spark.read.option("multiline","true").json("/mnt/SensorData/JsonData/SimpleJsonData/")
    display(df_json)
  2. After executing the preceding code, you can see the schema of the json data.
    Figure 2.21 – Simple Json Data

    Figure 2.21 – Simple Json Data

  3. Writing json file is identical to a CSV file. You can use following code snippet to write json file using Azure Databricks. You can specify different mode options while writing JSON data like append, overwrite, ignore and error or errorifexists. error mode is the default one and this mode throws exceptions if data already exists.
    multilinejsondf.write.format("json").mode("overwrite).save("/mnt/SensorData/JsonData/SimpleJsonData/")

Very often, you will come across scenarios where you need to process complex datatypes such as arrays, maps, and structs while processing data.

To encode a struct type as a string and to read the struct type as a complex type, Spark provides functions such as to_json() and from_json(). If you are receiving data from the streaming source in nested JSON format, then you can use the from_json() function to process the input data before sending it downstream.

Similarly you can use to_json() method to encode or convert columns in DataFrame to JSON string and send the dataset to various destination like EventHub, Data Lake storage, Cosmos database, RDBMS systems like SQL server, Oracle etc. You can follow along the steps required to process simple and nested Json in the following steps.

  1. Execute the following code to display the dataset from the mount location of storage account.
    df_json = spark.read.option("multiline","true").json("dbfs:/mnt/SensorData/JsonData/")
  2. You can see that the vehicle sensor data has Owner's attribute in the multiline json format.
    Figure 2.22 – Complex Json Data

    Figure 2.22 – Complex Json Data

  3. To convert Owners attribute into row format for joining or transforming data you can use explode() function. Execute the following command for performing this operation. Here you can see using explode() function the elements of Array has been converted into two columns named name and phone.
    Figure 2.23 – Explode function

    Figure 2.23 – Explode function

  4. To encode or convert the columns in the DataFrame to JSON string, to_json() method can be used. In this example we will create new DataFrame with json column from the existing DataFrame data_df in preceding step. Execute the following code to create the new DataFrame with json column.
    jsonDF = data_df.withColumn("jsonCol", to_json(struct([data_df[x] for x in data_df.columns]))) .select("jsonCol")
    display(jsonDF)
  5. After executing the preceding command, you will get new DataFrame with column named jsonCol.
    Figure 2.24 – Encode To Json

    Figure 2.24 – Encode To Json

  6. Using to_json() method you have converted _id, name and phone column 
to a new json column.

How it works…

Spark provides you with options to process data from multiple sources and in multiple formats. You can process the data and enrich it before sending it to downstream applications. Data can be sent to a downstream application with low latency on streaming data or high throughput on historical data.

You have been reading a chapter from
Azure Databricks Cookbook
Published in: Sep 2021
Publisher: Packt
ISBN-13: 9781789809718
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image