Making your cluster highly available
Historically, AWS EMR clusters were used for batch workloads and torn down afterward; errors in the worker nodes would be handled by YARN and HDFS’ innate resiliency. There was still the single point of failure of the primary node (previously called master), in the unlikely case that the whole batch process would need to be retried.
Hadoop has added full High Availability (HA) since the early days, as it was intended to run on permanent on-premise clusters that would be shared by many users and teams.
Since EMR 5.23, it allows running with multiple primary nodes. EMR takes care of the tedious process of correctly configuring Hadoop to be HA. Over time, it has also improved the process of automatically replacing a primary node and reconfiguring the system, so the cluster can graciously survive the failure of a single node with minimal or no disruption to the cluster users.
HA is important in cases where delays have a business impact...