Schema management for ETL pipelines
In this recipe, we will learn how to perform schema validation using Apache Spark and AWS Glue. Schema validation is essential to ensure the consistency and integrity of your data as it moves through various stages of the data pipeline. By validating schemas, you can prevent data quality issues and ensure that downstream applications receive data in the expected format.
Without schema validation, once the data reaches Redshift or Athena, it will cause data corruption errors from, for example, duplicate columns or using a wrong datatype. Schema-on-read is a feature of the modern data lake, which contrasts with the schema-on-write that is traditionally used in on-prem data warehouses. In the data lake environment, when the data moves through layers of the data lake, you typically need to define the schema and store the defined scheme either in a JSON config file or in a database that the ETL pipeline could later use to verify the schema of a file...