Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Business Intelligence with Databricks SQL

You're reading from   Business Intelligence with Databricks SQL Concepts, tools, and techniques for scaling business intelligence on the data lakehouse

Arrow left icon
Product type Paperback
Published in Sep 2022
Publisher Packt
ISBN-13 9781803235332
Length 348 pages
Edition 1st Edition
Languages
Arrow right icon
Author (1):
Arrow left icon
Vihag Gupta Vihag Gupta
Author Profile Icon Vihag Gupta
Vihag Gupta
Arrow right icon
View More author details
Toc

Table of Contents (21) Chapters Close

Preface 1. Part 1: Databricks SQL on the Lakehouse
2. Chapter 1: Introduction to Databricks FREE CHAPTER 3. Chapter 2: The Databricks Product Suite – A Visual Tour 4. Chapter 3: The Data Catalog 5. Chapter 4: The Security Model 6. Chapter 5: The Workbench 7. Chapter 6: The SQL Warehouses 8. Chapter 7: Using Business Intelligence Tools with Databricks SQL 9. Part 2: Internals of Databricks SQL
10. Chapter 8: The Delta Lake 11. Chapter 9: The Photon Engine 12. Chapter 10: Warehouse on the Lakehouse 13. Part 3: Databricks SQL Commands
14. Chapter 11: SQL Commands – Part 1 15. Chapter 12: SQL Commands – Part 2 16. Part 4: TPC-DS, Experiments, and Frequently Asked Questions
17. Chapter 13: Playing with the TPC-DS Dataset 18. Chapter 14: Ask Me Anything 19. Index 20. Other Books You May Enjoy

An overview of the Lakehouse architecture

If, at this point, you are a bit confused with so many terms such as databricks, lakehouse, Databricks SQL, and more – worry not. We are just at the beginning of our learning journey. We will unpack all of these throughout this book.

First, what is Databricks?

Databricks is a platform that enables enterprises to quickly build their Data Lakehouse infrastructure and enable all data personas – data engineers, data scientists, and business intelligence personnel – in their organization to extract and deliver insights from the data. The platform provides a curated experience for each data persona, enabling them to execute their daily workflows. The foundational technologies that enable these experiences are open source – Apache Spark, Delta lake, MLflow, and more.

So, what is the Lakehouse architecture and why do we need it?

The Lakehouse architecture was formally presented at the Conference on Innovative Data Systems Research (CIDR) in January 2021. You can download it from https://databricks.com/research/lakehouse-a-new-generation-of-open-platforms-that-unify-data-warehousing-and-advanced-analytics. This is an easily digestible paper that I encourage you to read for the full details. That said, I will now summarize the salient points from this paper.

Attribution, Where it is Due

In my summary of the said research paper, I am recreating the images that were originally provided. Therefore, they are the intellectual property of the authors of the research paper.

According to the paper, most of the present-day data analytics infrastructures look like a two-tier system, as shown in the following diagram:

Figure 1.1 – Two-tier data analytics infrastructures

Figure 1.1 – Two-tier data analytics infrastructures

In this two-tier system, first, data from source systems is brought onto a data lake. Examples of source systems could be your web or mobile application, transactional databases, ERP systems, social media data, and more. The data lake is typically an on-premises HDFS system or cloud object storage. Data lakes allow you to store data in big data-optimized file formats such as Apache Parquet, ORC, and Avro. The use of these open file formats enables flexibility in writing to the data lake (due to schema-on-read semantics). This flexibility enables faster ingestion of data, which, in turn, enables faster access to data for end users. It also enables more advanced analytics use cases in ML and AI.

Of course, this architecture still needs to support the traditional BI workloads and decision support systems. Hence, a second process, typically in the form of Extract, Transform, and Load (ETL), is built to copy data from the data lake to a dedicated data warehouse.

Close inspection of the two-tier architecture reveals several systemic problems:

  • Duplication of data: This architecture requires the same data to be present in two different systems. This results in an increased cost of storage. Constant reconciliation between these two systems is of utmost importance. This results in increased ETL operations and its associated costs.
  • Security and governance: Data lakes and data warehouses have very different approaches to the security of data. This results in different security mechanisms for the same data that must always be in synchronization to avoid data security violations.
  • Latency in data availability: In the two-tier architecture, the data is only moved to the warehouse by a secondary process, which introduces latency. This means analysts do not get access to fresh data. This also makes it unsuitable for tactical decision support such as operations.
  • Total cost of ownership: Enterprises end up paying double for the same data. There are two storage systems, two ETL processes, two engineering debts, and more.

As you can see, this is unintuitive and unsustainable.

Hence, the paper presents the Lakehouse architecture as the way forward.

Simply put, the data lakehouse architecture is a data management system that implements all the features of data warehouses on data lakes. This makes the data lakehouse a single unified platform for business intelligence and advanced analytics.

This means that the lakehouse platform will implement data management features such as security controls, ACID transaction guarantees, data versioning, and auditing. It will implement query performance features such as indexing, caching, and query optimizations. These features are table stakes for data warehouses. The Lakehouse architecture brings these features to you in the flexible, open format data storage of data lakes. A Lakehouse is a platform that provides data warehousing capabilities and advanced analytics capabilities for the same platform, with cloud data lake economics.

What is the Formal Definition of the Lakehouse?

Section 3 in the CIDR paper officially defines the Lakehouse. Check it out.

The following is a visual depiction of the Lakehouse:

Figure 1.2 – Lakehouse architecture

Figure 1.2 – Lakehouse architecture

The idea of the Lakehouse is deceptively simple – as all good things in life are! The Lakehouse architecture immediately solves the problems we highlighted about present-day two-tier architectures:

  • A single storage layer means no duplication of data and no extra effort to reconcile data. Reduced ETL requirements and ACID guarantees equate to the stability and reliability of the system.
  • A single storage layer means a single model of security and governance for all data assets. This reduces the risk of security breaches.
  • A single storage layer means the availability of the freshest data possible for the consumers of the data.
  • Cheap cloud storage with elastic, on-demand cloud compute reduces the total cost of ownership.
  • Open source technologies in the storage layer reduce the chances of vendor lock-in and make it easy to integrate with other tools.

Of course, any implementation of the Lakehouse will have to ensure the following:

  • Reliable data management: The Lakehouse proposes to eliminate (or reduce) data warehouses. Hence, the Lakehouse implementation must efficiently implement data management and governance – features that are table stakes in data warehouses.
  • SQL performance: The Lakehouse will have to provide state-of-the-art SQL performance on top of the open-access filesystems and file formats typical in data lakes.

This is where the Databricks Lakehouse platform, and within it, the Databricks SQL product, comes in.

You have been reading a chapter from
Business Intelligence with Databricks SQL
Published in: Sep 2022
Publisher: Packt
ISBN-13: 9781803235332
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image