Tuning Apache Spark resource usage
Apache Spark is probably the most-used framework on EMR. It works by running YARN containers for executor instances and another one for the Application Manager (AM), which often also acts as the Spark driver.
In traditional on-premises clusters, the cluster has many resources that are shared by many cluster users, so you use as few resources as possible for the job at hand. On EMR, it is simple to run clusters on demand, which are fit for purpose, and then shut them down when the job is complete. That way, the cluster size, nodes, and configuration can be optimized for the specific job and negative interactions between users, such as resource starvation or saturation, can be avoided.
In such dedicated clusters, you want your application, such as Apache Spark, to make the most of the hardware provided since it doesn’t have to share it with other users. That’s why EMR added the maximizeResourceAllocation
configuration option for...