You're reading from Modern Data Architectures with Python A practical guide to building and deploying data pipelines, data warehouses, and data lakes with Python

Product type Paperback

Published in Sep 2023

Publisher Packt

ISBN-13 9781801070492

Length 318 pages

Edition 1st Edition

Languages

Python

Tools

MLflow

Concepts

Data Science

Author (1):

Brian Lipp

View More author details

Table of Contents (19) Chapters

Preface

1. Part 1:Fundamental Data Knowledge

2. Chapter 1: Modern Data Processing Architecture FREE CHAPTER

3. Chapter 2: Understanding Data Analytics

4. Part 2: Data Engineering Toolset

5. Chapter 3: Apache Spark Deep Dive

6. Chapter 4: Batch and Stream Data Processing Using PySpark

7. Chapter 5: Streaming Data with Kafka

8. Part 3:Modernizing the Data Platform

9. Chapter 6: MLOps

10. Chapter 7: Data and Information Visualization

11. Chapter 8: Integrating Continous Integration into Your Workflow

12. Chapter 9: Orchestrating Your Data Workflows

13. Part 4:Hands-on Project

14. Chapter 10: Data Governance

15. Chapter 11: Building out the Groundwork

16. Chapter 12: Completing Our Project

17. Index

Why subscribe?

18. Other Books You May Enjoy

Databases, data warehouses, and data lakes

The history of data processing is long and has had several unique innovations. As a result, we can look around today and see several patterns in our offices – everything from data stored as a file on a network drive, local hard drive, or technology such as S3 to relational database management systems (RDBMSes) and data warehouses. We will now go through the major types of data storage systems and see some of the benefits and disadvantages of each.

OLTP

When people think of storing data, the first thing that always comes to mind is the traditional relational database. These databases are designed to process and store short data interactions reliably and consistently. On the other hand, online transaction processing (OLTP) systems are very good at handling small interactions, called create, read, update, and delete (CRUD). OLTP systems come in two primary flavors: relational and NoSQL. We will cover the details of each type later, but for now, we should simply understand that within these two types of data stores, we can use any data that has some amount of structure to it. A classic example of a use case for OLTP would be a web application for a grocery store. These types of data actions are small and quick. Therefore, we would typically not see a much-extended workload on an OLTP system in typical data processing usage. Some examples of OLTP systems are MongoDB, Microsoft SQL, PostgreSQL, and CockroachDB, among others. In other words, a data storage system that is used to run a business on a day-to-day basis is an OLTP system. These have frequent insert, update, and delete operations and we are more interested in the throughput of these systems compared to their response time from a performance perspective. Most of these OLTP systems will be ACID-compliant (that is, atomicity, consistency, isolation, and duration).

OLAP

On the other side of the spectrum, we have online analytical processing (OLAP) systems, which are better designed to handle intense data processing workloads. We will avoid being pedantic about what precisely an OLAP is and instead paint a broad picture – two examples are an OLAP data warehouse and a lakehouse. A data warehouse, at its core, is simply an OLAP system that stores and curates data using data warehousing techniques. Data warehouses are trusted sources of data for decision-making. More often than not, a data warehouse will only store structured data. On the other hand, a data lake is a form of storage that stores data in its most native format. Data lakes can also serve as entry points to data warehouses. We will cover lakehouses in more detail later, but they can be understood as hybrids of data warehouses and data lakes.

Data lakes

So, we spoke of data warehouses and databases, which use data that follows a basic structure or schema. Since the cost of disk storage has reduced over time, the desire to keep more and more data has become a reality. This data began to take on less structure, such as audio, video, and text documents. Our current OLTP and OLAP systems were not the ideal tools for enormous amounts of data or data with a wide variety of structures. The data lake emerged as a way to store all the data in systems such as HDFS and AWS S3. Data lakes typically have schema on read, whereas OLTP and OLAP systems generally are schema on read and write. This wide range of flexibility often leads to a data dumping ground or a data swamp. The conventional data warehouses were monolithic, on-premises systems with the compute and storage combined. With the advent of big data, we saw distributed computing, where data was spilt across multiple machines. However, each machine still had combined compute and storage. With the advent of the cloud, this paradigm of splitting both the compute and storage across machines enabled by distributed computing took effect. This is much more efficient. Moreover, on the cloud, the services we use and therefore this OPEX model worked better than the conventional CAPEX model, which was a dead-cost investment.

With the emergence of data lakes came large-scale data processing engines such as Apache Hadoop, Apache Spark, and Apache Flink, to name a few. The most crucial detail to understand about this type of technology is the separation of the storage layer and the compute layer. This pattern exists in all systems designed to handle big data. You may not even know that a system uses this pattern, as with Snowflake or Big Query. There are both significant benefits and negative considerations regarding this type of data processing.

There is one universal rule when understanding data processing – the costs. Moving data is very expensive. Moving data from processor to disk is expensive but moving it across the network is exponentially costly. This must be a design consideration when you’re picking your stack. There are situations where that cost is acceptable, and your application is okay with waiting a few extra seconds. This is one of the reasons you don’t typically see decoupled storage and compute patterns with CRUD web applications.

In the following diagram, the left shows a typical data warehouse or database that has everything built into one system – that is, the storage and compute engines are together. The right-hand side of the diagram shows that they are decoupled, meaning they’re separate systems:

Figure 1.1: Storage and compute

Event stores

Another trend in data storage is using systems such as Kafka, a distributed event store and streaming processing engine. Event stores can be considered data stores with several logs that can be watched or read from start to finish. Event stores are often associated with real-time processing. The term real-time is often used to describe data that is flowing in a relatively real-time-like process. Real-time data is used in many data platforms and can come with its own set of complexities and problems. We will provide a whole chapter on streaming data using both Spark and Kafka. For now, it’s enough to understand that real-time data attempts to store, process, and access data as soon as it’s recorded.

File formats

Whenever you’re working with data, you will inevitably want to save it in a file. The question is, which file should you choose? Some formats are ideal for short-term office work. Other formats, on the other hand, are designed for long-term storage. When you’re selecting what file format to use, first, consider the use case for the file and how long you will keep it.

CSV

Here is an example of a CSV file. In this case, I am using , for the column delimiter and \\n for the line delimiter:

id,first_name,last_name,email\\n
1,Alyce,Creeber,[email protected]\\n
2,Gladi,Fenney,[email protected]\\n
3,Mendy,Papen,[email protected]\\n
4,Gerri,Kernan,[email protected]\\n
5,Luca,Skeen,[email protected]\\n

Comma-separated values (CSV) files are the most common data files that are found in an office (outside of an Excel file). CSV is a text format that uses a set of characters to identify column cells and end-of-line characters. CSVs are used for structured data. Sometimes, you can create semi-structured scenarios, but it’s not very effective. CSVs are text files, so the file’s structure can often have varying characteristics such as headers, banners, or free-form text. CSVs are an excellent short-term office file format for doing work. However, CSVs don’t provide data types, which, combined with the other issues mentioned previously, makes them a terrible choice for long-term storage.

JSON

Here, we have an example JavaScript Object Notation (JSON) object with three objects inside the parent object:

[{
  "id": 1,
  "first_name": "Ermanno",
  "last_name": "Marconi",
  "email": "[email protected]"
}, {
  "id": 2,
  "first_name": "Lory",
  "last_name": "Maskew",
  "email": "[email protected]"
}, {
  "id": 3,
  "first_name": "Karee",
  "last_name": "Hubbucke",
  "email": "[email protected]"
}]

JSON is a plain text-based format that uses strict syntax to define semi-structured or structured data. In the preceding example, we are mimicking structured data, but we could have nested more objects or used an array. JSON format overcomes the issue of the poorly defined syntax of CSVs, which makes it a more ideal choice for most use cases. However, JSON is not my first choice for long-term storage because it lacks data types and contains no compression.

Avro

Here is an example schema file for the Avro format. Notice how we have two “columns” of data called Make and ID in the awesome_startup namespace:

{
   "type" : "record",
   "namespace" : "awesome_startup",
   "name" : "cars",
   "fields" : [
      { "name" : "Make" , "type" : "string" },
      { "name" : "ID" , "type" : "int" }
   ]
}

Avro is a popular open standard protocol often found in data lakes. Avro serializes data, making it small and compact compared to formats such as JSON. Avro supports both structured and semi-structured data like JSON. Avro has a secondary file that is in JSON format that defines the data types and structure of your data. With this schema file, you can evolve your schema by making changes but keep backward compatibility. This is a huge advantage when it comes to long-term data storage. Avro is designed to be accessed row by row or row storage. Row storage is ideal for cases when you look up a row and read the whole row. Although Avro is a significant step up from JSON, it still lacks in several ways. What happens when the schema file is lost? Typically, the data is unusable, which is less than ideal. Row storage is perfect for CRUD-style workflows, but many data-intense workflows will read a whole column at a time; this can be costly in Avro.

Parquet

Parquet has emerged as the best-of-breed file format in the majority of cases. Parquet supports semi-structured data and structured data like Avro. Parquet is an open standard format that, like Avro, serializes its data for small footprints. On the other hand, Parquet stores the schema within the file, which overcomes several shortcomings of Avro. Parquet, unlike row-oriented formats like Arvo, is column-oriented. This translates into faster data access and writing for most workflows.

Data platform architecture at a high level

What is data architecture, and why do I care? Data architecture is the process of designing and building complex data platforms. This involves taking a comprehensive view, which includes not only moving and storing data but also all aspects of the data platform. Building a well-designed data ecosystem can be transformative to a business.

What goes into architecting a data platform? In picking your solution, you will evaluate the price, cloud vendors, and multi-cloud options, among many other choices. Using a hosted option for a service might make sense, or it may be inappropriate. You might want to stick with one vendor for everything, or you may decide to get the best-of-breed technologies. Will you need streaming data? How will you manage your data? What is the long-term vision for the project? Is vendor and product lock-in something to consider? All these questions and more are answered by the data architect and the platform that gets designed.

I will now introduce a data platform reference architecture that is used to organize and better understand what exists in your data ecosystem.

Here is a data architecture reference diagram. I have put data governance, processing, and storage across all areas since they interact with everything:

Figure 1.2: Data platform architecture

Let’s break this architecture down.

Storage layer

In this layer, we include technologies that persist data in long-term storage, such as OLTP, OLAP, lakehouses, event stores, and data lakes. We also have file types such as Parquet, Avro, and CSV.

Ingestion layer

The ingestion layer focuses on taking your data from whatever source system it may live in, such as a social media site, and moving that data into the storage layer. This layer may include commercial products such as Fivetran or Stich, but often, it will also involve writing code to accomplish this task.

Analytics layer

In the analytics layer, we will see a variety of work that ranges from data science, machine learning, and artificial intelligence to graph analytics and statical analysis. The output from the analysis will be represented in a “view” that the consumption layer can access. You will also see data modeling in this layer. Data modeling is the process of building tables to better understand and represent your data from various perspectives.

Consumption layer

The consumption layer stores the output views created by the analytics layer. The technology chosen for this layer can vary, but one example might be a machine learning model that’s recorded and tracked in MLflow and stored in S3.

Serving layer

The serving layer consists of business intelligence (BI), dashboards, data visualizations, search engines, and other systems that use data products in the consumption layer.

Data governance layer

The data governance layer contains master data management, data quality enforcement, data security, data auditing, and metadata management. We will cover the fundamental concepts of this layer in a separate chapter.

Processing layer

The processing layer is the workhorse that handles moving data between all the other layers. Technologies such as Apache Spark, Flink, Snowflake, DBT, Dataflow, and Azure Data Factory handle this task.

Semantic view

The semantic view, also known as the virtual layer, is an optional abstraction in your data platform that allows you to decouple user access from data storage. In simple terms, users will have a view of the data stored in its “original” source, but without the need to access the source directly. Data could be stored in anything ranging from relational databases and data lakes to REST APIs. This virtual layer is then used to create new modeled data products. Users might access these data products from various sources, such as BI tooling, notebooks, or even internally developed applications using whatever access method is needed. These new data products are curated from different sources and are tailored to the needs of the users and the use cases. Often, standard terms are normalized across data products. In an ideal world, there is no storage in the semantic layer. However, you may need to store copies of the data for faster access, for example, if you’re integrating with another corporate application. There are several benefits to this added complexity, including central data governance and future proofing for any data storage solution changes.