Introducing Hadoop
Hadoop was released in 2006 and was a revolutionary concept of using hundreds, thousands, or even more compute nodes to solve and crunch big datasets. This framework was a great fit to apply to commodity resources such as old aging servers; companies that may have had a fleet of aging hardware could repurpose them with this framework to be powerful data processing clusters. This is also a great fit for AWS as there are massive amounts of unused capacity that have to be available for demand spikes. Thus, EMR was born and is commonly used with EC2 Spot Instances, which is discounted compute resources that are older and less used capacity. Spot instances can have savings of up to 90% off their on-demand prices but can be revoked by Amazon at any time, whereby you are given a 1-hour notification. This makes them a great option for Hadoop, which can restart tasks if a node is removed.
Introduction to EMR
Let’s dive into EMR and why it is such a powerful service...