Creating data quality for ETL jobs in AWS Glue Studio notebooks
From the Applying a data quality check on Glue tables recipe in Chapter 6, Governing Your Platform, we learned how to set a ruleset for the Glue pipeline. In this recipe, we will dive deeper into how to use Glue Studio notebooks to build a Data Quality template. Using Glue Studio is useful because you can see the output along with the dataset that you are testing. We will also introduce how to use caching and produce row-level and rule-level outputs. The row-level output would be suitable for using data quality rule violations for each of the records.
Getting ready
Before proceeding with this recipe, go through the Applying a data quality check on Glue tables recipe in Chapter 6, Governing Your Platform, and ensure that you have basic knowledge of how Glue works as covered in Chapter 3, Ingesting and Transforming Your Data with AWS Glue. In this recipe, we will provide the code to run the quality check scenarios...