Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Data Engineering with AWS Cookbook

You're reading from   Data Engineering with AWS Cookbook A recipe-based approach to help you tackle data engineering problems with AWS services

Arrow left icon
Product type Paperback
Published in Nov 2024
Publisher Packt
ISBN-13 9781805127284
Length 528 pages
Edition 1st Edition
Languages
Arrow right icon
Authors (4):
Arrow left icon
Viquar Khan Viquar Khan
Author Profile Icon Viquar Khan
Viquar Khan
Gonzalo Herreros González Gonzalo Herreros González
Author Profile Icon Gonzalo Herreros González
Gonzalo Herreros González
Huda Nofal Huda Nofal
Author Profile Icon Huda Nofal
Huda Nofal
Trâm Ngọc Phạm Trâm Ngọc Phạm
Author Profile Icon Trâm Ngọc Phạm
Trâm Ngọc Phạm
Arrow right icon
View More author details
Toc

Table of Contents (16) Chapters Close

Preface 1. Chapter 1: Managing Data Lake Storage 2. Chapter 2: Sharing Your Data Across Environments and Accounts FREE CHAPTER 3. Chapter 3: Ingesting and Transforming Your Data with AWS Glue 4. Chapter 4: A Deep Dive into AWS Orchestration Frameworks 5. Chapter 5: Running Big Data Workloads with Amazon EMR 6. Chapter 6: Governing Your Platform 7. Chapter 7: Data Quality Management 8. Chapter 8: DevOps – Defining IaC and Building CI/CD Pipelines 9. Chapter 9: Monitoring Data Lake Cloud Infrastructure 10. Chapter 10: Building a Serving Layer with AWS Analytics Services 11. Chapter 11: Migrating to AWS – Steps, Strategies, and Best Practices for Modernizing Your Analytics and Big Data Workloads 12. Chapter 12: Harnessing the Power of AWS for Seamless Data Warehouse Migration 13. Chapter 13: Strategizing Hadoop Migrations – Cost, Data, and Workflow Modernization with AWS 14. Index 15. Other Books You May Enjoy

What this book covers

Chapter 1, Managing Data Lake Storage, covers the fundamentals of managing S3 buckets. We’ll focus on implementing robust security measures through data encryption and access control, managing costs by optimizing storage tiers and applying retention policies, and utilizing monitoring techniques to ensure timely issue resolution. Additionally, we’ll cover other essential aspects of S3 bucket management.

Chapter 2, Sharing Your Data Across Environments and Accounts, presents methods for securely and efficiently sharing data across different environments and accounts. We will explore strategies for load distribution and collaborative analysis using Redshift data sharing and RDS replicas. We will implement fine-grained access control with Lake Formation and manage Glue data sharing through both Lake Formation and Resource Access Manager (RAM). Additionally, we will discuss real-time sharing via event-driven services, temporary data sharing with S3, and sharing operational data from CloudWatch.

Chapter 3, Ingesting and Transforming Your Data with AWS Glue, explores different features of AWS Glue when building data pipelines and data lakes. It covers the multiple tools and engines provided for the different kinds of users, from visual jobs with little or no code to managed notebooks and jobs using the different data handling APIs provided.

Chapter 4, A Deep Dive into AWS Orchestration Frameworks, explores the essential services and techniques for managing data workflows and pipelines on AWS. You’ll learn how to define a simple workflow using AWS Glue Workflows, set up event-driven orchestration with Amazon EventBridge, and create data workflows with AWS Step Functions. We also cover managing data pipelines using Amazon MWAA, monitoring their health, and setting up a data ingestion pipeline with AWS Glue to bring data from a JDBC database into a catalog table.

Chapter 5, Running Big Data Workloads with Amazon EMR, teaches how to make the most of your AWS EMR clusters and explore the service features that enable them to be customizable, efficient, scalable, and robust.

Chapter 6, Governing Your Platform, presents the key aspects of data governance within AWS. This includes data protection techniques such as data masking in Redshift and classifying sensitive information using Maice. We will also cover ensuring data quality with Glue quality checks. Additionally, we will discuss resource governance to enforce best practices and maintain a secure, compliant infrastructure using AWS Config and resource tagging.

Chapter 7, Data Quality Management, covers how to use AWS Glue Deequ and AWS DataBrew to automate data quality checks and maintain high standards across your datasets. You will learn how to define and enforce data quality rules and monitor data quality metrics. This chapter also provides practical examples and recipes for integrating these tools into your data workflows, ensuring that your data is accurate, complete, and reliable for analysis.

Chapter 8, DevOps – Defining IaC and Building CI/CD Pipelines, explores multiple ways to automate AWS services and CI/CD deployment pipelines, the pros and cons of each tool, and examples of common data product deployments to illustrate DevOps best practices.

Chapter 9, Monitoring Data Lake Cloud Infrastructure, provides a comprehensive guide to the day-to-day operations of a cloud-based data platform. It covers key topics such as monitoring, logging, and alerting using AWS services such as CloudWatch, CloudTrail, and X-Ray. You will learn how to set up dashboards to monitor the health and performance of your data platform, troubleshoot issues, and ensure high availability and reliability. This chapter also discusses best practices for cost management and scaling operations to meet changing demands, making it an essential resource for anyone responsible for the ongoing maintenance and optimization of a data platform.

Chapter 10, Building a Serving Layer with AWS Analytics Services, guides you through the process of building an efficient serving layer using AWS Redshift, Athena, and QuickSight. The serving layer is where your data becomes accessible to end-users for analysis and reporting. In this chapter, you will learn how to load data from your data lake into Redshift, query it using Redshift Spectrum and Athena, and visualize it using QuickSight. This chapter also covers best practices for managing different QuickSight environments and migrating assets between them. By the end of this chapter, you will have the knowledge to create a powerful and user-friendly analytics layer that meets the needs of your organization.

Chapter 11, Migrating to AWS – Steps, Strategies, and Best Practices for Modernizing Your Analytics and Big Data Workloads, presents a theoretical framework for migrating data and workloads to AWS. It explores key concepts, strategies, and best practices for planning and executing a successful migration. You’ll learn about various migration approaches—rehosting, replatforming, and refactoring—and how to choose the best option for your organization’s needs. The chapter also addresses critical challenges and considerations, such as data security, compliance, and minimizing downtime, preparing you to navigate the complexities of cloud migration with confidence.

Chapter 12, Harnessing the Power of AWS for Seamless Data Warehouse Migration, explores the key strategies for efficiently migrating data warehouses to AWS. You’ll learn how to generate a migration assessment report using the AWS Schema Conversion Tool (SCT), extract and transfer data with AWS Database Migration Service (DMS), and handle large-scale migrations with the AWS Snow Family. You’ll also learn how to streamline your data migration, ensuring minimal disruption and maximum efficiency while transitioning to the cloud.

Chapter 13, Strategizing Hadoop Migrations – Cost, Data, and Workflow Modernization with AWS, guides you through essential recipes for migrating your on-premises Hadoop ecosystem to AWS, covering a range of critical tasks. You’ll learn about cost analysis using the AWS Total Cost of Ownership (TCO) calculators and the Hadoop Migration Assessment tool. You’ll also learn how to choose the right storage solution, migrate HDFS data using AWS DataSync, and transition key components such as the Hive Metastore and Apache Oozie workflows to AWS EMR. We also cover setting up a secure network connection to your EMR cluster, seamless HBase migration to AWS, and transitioning HBase to DynamoDB.

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image