You're reading from Data Observability for Data Engineering Proactive strategies for ensuring data accuracy and addressing broken data pipelines

Product type Paperback

Published in Dec 2023

Publisher Packt

ISBN-13 9781804616024

Length 228 pages

Edition 1st Edition

Languages

Python

Tools

SQL Server

Concepts

Data Engineering

Authors (2):

Michele Pinto

Sammy El Khammal

View More author details

Table of Contents (17) Chapters

Preface

1. Part 1: Introduction to Data Observability

2. Chapter 1: Fundamentals of Data Quality Monitoring FREE CHAPTER

3. Chapter 2: Fundamentals of Data Observability

4. Part 2: Implementing Data Observability

5. Chapter 3: Data Observability Techniques

6. Chapter 4: Data Observability Elements

7. Chapter 5: Defining Rules on Indicators

8. Part 3: How to adopt Data Observability in your organization

9. Chapter 6: Root Cause Analysis

10. Chapter 7: Optimizing Data Pipelines

11. Chapter 8: Organizing Data Teams and Measuring the Success of Data Observability

12. Part 4: Appendix

13. Chapter 9: Data Observability Checklist

14. Chapter 10: Pathway to Data Observability

15. Index

Why subscribe?

16. Other Books You May Enjoy

Alerting on data quality issues

Once indicators have been defined, it is important to set up systems to control and assess the quality of the data as per these indicators. An easy way to do so is by testing the quality with rules.

An indicator reflects the state of the system and is a proxy for one or several dimensions of data quality. Rules can be set to create a link between the indicators and the objective(s). For a producer, violating these rules is equivalent to a failed objective. Incorporating an alerting system aims to place the responsibility of detecting data quality issues in the producer’s hands.

An indicator can be the fuel of several rules. Indicators can also be used over time to create a rule system involving variations.

Using indicators to create rules

Collecting indicators is the first step toward monitoring. After that, to prevent data quality issues, you need to understand the normality of your data.

Indicators reflect the current state of the dataset. An indicator is a way of measuring data quality but does not assess the quality per se. To do so, you need to understand whether the indicator is valid or not.

Moreover, data quality indicators can also be used to prevent further issues and define other objectives linked to the agreement. Using lineage, you can define whether modifying indicators upstream in the data flow can have an impact on the SLA you want to support.

Rules can be established based on one indicator, several indicators, or even the observation of an indicator over time.

Rules using standalone indicators

A single indicator can be the source of detection of a major issue. An easy way to create a rule on an indicator is to set up an acceptable range for the indicator. If the data item has to represent an age, you can set up rules on the distribution indicators of minimum and maximum. You probably don’t want the minimum age to be a negative number and you don’t want the maximum to be exaggerated (let’s say more than 115 years old).

Rules using a combination of indicators

Several indicators can be used in a single rule to support an objective. The missing value indicator in a column can be influenced by the missing values of other datasets. In a CRM, the Full Name column can be a combination of the First Name and Surname columns. If one row of the First Name column is empty, there is a high probability that Full Name will also fail. In that case, setting a rule on the First Name column also ensures that the completeness objective of Full Name is fulfilled.

Rules based on a time series of indicators

The variation of an indicator over time should also be taken into consideration. This can be valuable in the completeness dimension, for instance. The volume of data, which is the number of rows you process in the application, can vary over months. This variation can be monitored, alerting the producer if there is a drop of more than 20% in the number of rows.

Rules should be the starting point of alerts. In turn, these alerts can be used to detect but also prevent any issue. When a rule detects an issue, it helps to ensure a trust relationship with the consumer as they will be able to assess the quality of the data before using it.

The data scorecard

With the rules associated with the data source, indicators and rules can be used to create a non-subjective data scorecard. This scorecard is an easy way for the business team to assess the quality of the data comprehensively. The scorecard can help the consumer share their issues with the producers, avoiding the traditional My data is broken, please fix it! issue. Instead, the consumer can stress the reason for the failing job – I’ve noticed a drop in the quality of my dataset: the percentage of null rows in the Age column exceeds 3%. It also helps the consumer understand the magnitude of the problem, and the producer to prioritize their work. You won’t react the same if the number of null rows is 2% as if the number of missing values has been bumped up by more than 300%.

The primary advantage of a scorecard is that it aims to increase the trustworthiness of the dataset. Even if the score is not the best one today, the user is reassured and knows that an issue will be detected by the producer itself. As a result, the latter gains in reputation. Creating a scorecard for the datasets you produce demonstrates your data maturity. It promotes a culture of continuous improvement within the team and organization.

Also, a scorecard helps in assessing data quality issues. By assigning weights to different dimensions of data quality, the scorecard allows you to prioritize aspects of data quality to ensure the most critical dimensions get the necessary focus.

This scorecard can be created per data usage, which means that a score can be associated with each SLA. We suggest the scorecard is a mirror of the data quality dimensions. Let’s look at some techniques for creating such a scorecard.

Creating a scorecard – the naïve way

To start with an easy implementation of the scorecard, you can use a percentage of the number of rules met (rules not being broken) over the total number of rules. This gives you a number between 0 and 100 and tells you how the data source behaved when it was last assessed.

Creating a scorecard – the extensive way

This scorecard is created based on one or several dimensions of data quality. For each dimension and each SLA, you can compute the score of the rules used to test those dimensions. To do so, follow these steps:

Identify the data quality dimensions relevant to the data source.
Assign weight, or importance, to each quality dimension based on its importance to the business objectives.
Define rules for each dimension you want to cover while considering the requirements of the SLAs.
Compute a score for each rule. You can use the naïve approach of counting the number of respected rules or you can use a more sophisticated approach.
Compute the weighted score by multiplying the score of each rule by the weight of its corresponding dimension and summing all the results.

By visualizing and tracking the created scores, you can easily share them with your stakeholders and compare data sources with each other, as well as detecting trends and patterns.

Let’s summarize what we’ve learned in this chapter.