Learning about output partitions
Saving partitioned data using the proper condition can significantly boost performance while you're reading and retrieving data for further processing.
Reading the required partition limits the number of files and partitions that Spark reads while querying data. It also helps with dynamic partition pruning.
But sometimes, too many optimizations can make things worse. For example, if you have several partitions, data is scattered within multiple files, so searching the data for particular conditions in the initial query can take time. Also, memory utilization will be more while processing the metadata table as it contains several partitions.
While saving the in-memory data to disk, you must consider the partition sizes as Spark produces files for each task. Let's consider a scenario: if the cluster configuration has more memory for processing the dataframe and saving it as larger partition sizes, then processing the same data...