Processing data incrementally using bookmarks and bounded execution
Data pipelines often need to process data as it gets continuously generated and the ETL pipelines have to run on a regular basis. For such cases where the use (and extra cost) of streaming is not justified (for instance, if the data is uploaded once a day), using bookmarks is a simple way of keeping track of which files are already processed and which are new since the last run. With bookmarks, you can run a scheduled job on a regular basis and process only new data added since the last run.
In addition, Glue provides an optional feature called bounded execution
; with it, a limited amount of data (size or files) is handled in each bookmarked run. This allows the job to run in a timely fashion and predictably with a volume of data that has been tested and not run into issues with memory, disk, or latency. This can be useful if you are backloading a large amount of data or new data arrives in bursts.