You're reading from Data Engineering with AWS Cookbook A recipe-based approach to help you tackle data engineering problems with AWS services

Product type Paperback

Published in Nov 2024

Publisher Packt

ISBN-13 9781805127284

Length 528 pages

Edition 1st Edition

Languages

Python

Tools

AWS Glue

Concepts

Data Engineering

Authors (4):

Viquar Khan

Gonzalo Herreros González

Huda Nofal

Trâm Ngọc Phạm

View More author details

Table of Contents (16) Chapters

Preface

1. Chapter 1: Managing Data Lake Storage

2. Chapter 2: Sharing Your Data Across Environments and Accounts FREE CHAPTER

3. Chapter 3: Ingesting and Transforming Your Data with AWS Glue

4. Chapter 4: A Deep Dive into AWS Orchestration Frameworks

5. Chapter 5: Running Big Data Workloads with Amazon EMR

6. Chapter 6: Governing Your Platform

7. Chapter 7: Data Quality Management

8. Chapter 8: DevOps – Defining IaC and Building CI/CD Pipelines

9. Chapter 9: Monitoring Data Lake Cloud Infrastructure

10. Chapter 10: Building a Serving Layer with AWS Analytics Services

11. Chapter 11: Migrating to AWS – Steps, Strategies, and Best Practices for Modernizing Your Analytics and Big Data Workloads

12. Chapter 12: Harnessing the Power of AWS for Seamless Data Warehouse Migration

13. Chapter 13: Strategizing Hadoop Migrations – Cost, Data, and Workflow Modernization with AWS

14. Index

Why subscribe?

15. Other Books You May Enjoy

Scaling your cluster based on workload

The main benefit of running on the cloud compared to on-premises is the access to virtually endless capacity. When running EMR workloads, you don’t want to just have resources available but also to only pay for them when needed to be cost-effective.

In this recipe, you will see how EMR can effortlessly allow you to scale your cluster capacity based on the workload.

Getting ready

This recipe assumes that you have set up the SUBNET environment variable as indicated in the Technical requirements section at the beginning of this chapter.

How to do it...

Create a cluster with autoscale and idle timeout (make sure you use \ only at the end of the lines indicated; the second command will print the cluster ID):

CLUSTER_ID=$(aws emr create-cluster --name AutoScale\
 --release-label emr-7.1.0 --use-default-roles \
 --ec2-attributes SubnetId=${SUBNET} \
 --auto-termination-policy IdleTimeout=900 \
 --applications Name=Spark --instance...