Implementing the DAG
As we begin implementing our system with the aforementioned design, there are some basic capabilities within the system we’re building that we’ll need to determine how to implement.
Determining whether data has changed
When a new dataset is uploaded to a remote site, a secondary file that contains the md5 hash of the dataset is included. This md5 checksum will only change when the dataset we plan on downloading changes, so we’ll begin our DAG by downloading and checking the contents of that file against a previously stored copy of that hash. If they don’t match, we’ll continue down the whole pipeline; otherwise, we’ll gracefully exit and check again next week:
data_is_new = BranchPythonOperator( task_id = "data_is_new", python_callable=_data_is_new ) def _data_is_new...