Handling job failures and reruns for partial results
When building ETL pipelines (with Glue or in general), it’s important to consider different scenarios that could make the job fail and how to deal with it. Ideally, we want the recovery to be automatic, at least for transient issues, but regardless of the recovery method, the most important aspect is that the jobs don’t result in permanent data loss or duplication due to the issue. In traditional databases, this is solved using transactions, but in the case of big data ETL, that is rarely an option or would cause too much overhead.
In this recipe, you will see how to deal with job failures and resulting partial results.
Getting ready
This recipe requires a bash
shell with the AWS CLI installed and configured and the GLUE_ROLE_ARN
and GLUE_BUCKET
environment variables set, as indicated in the Technical requirements section at the beginning of the chapter.
How to do it…
- Create a job script as...