Handling a high quantity of small files in your job
Frequently, when ingesting data, the source data is not optimized and it comes in tiny files, maybe because it was produced at short intervals or by many sources such as different sensors sending their individual reports. Apache Spark was designed as a big data tool and it struggles when handling such cases, causing inefficiency when processing too many partitions and also causing memory issues on the driver when building a plan.
To handle data efficiently, we want to consolidate small files to make the reading more efficient, especially if using a columnar format such as Parquet; as a rule of thumb, at least 100 MB bytes on each file. The simple way to control that is to repartition/coalesce the data to the target number of output files, but that often requires a costly shuffle operation.
In this recipe, you will see a simple and effective way provided by Glue to group small files at reading time.
Getting ready
This recipe...