Running Big Data Workloads with Amazon EMR
Amazon EMR is a managed service that allows running big data frameworks such as Apache Spark or Apache Hive on the Apache Hadoop ecosystem. It provides clusters for data applications to handle large amounts of data in a distributed and scalable way.
EMR removes the complexity of having to deploy, configure, and coordinate all these open source frameworks and tools to work together, so you can just start using them. Each version of EMR lists all the specific frameworks and the specific versions it provides.
Unlike other AWS-managed services, EMR allows you to have full control and visibility of your cluster: which hardware to run, which EC2 image to use, what to install, and even root access to the cluster (except when you run on EMR serverless mode). The recipes in this chapter will help you learn the EMR capabilities and how to make the best use of them.
This chapter includes the following recipes:
- Running jobs using AWS...