What this book covers
Chapter 1, Managing Data Lake Storage, covers the fundamentals of managing S3 buckets. We’ll focus on implementing robust security measures through data encryption and access control, managing costs by optimizing storage tiers and applying retention policies, and utilizing monitoring techniques to ensure timely issue resolution. Additionally, we’ll cover other essential aspects of S3 bucket management.
Chapter 2, Sharing Your Data Across Environments and Accounts, presents methods for securely and efficiently sharing data across different environments and accounts. We will explore strategies for load distribution and collaborative analysis using Redshift data sharing and RDS replicas. We will implement fine-grained access control with Lake Formation and manage Glue data sharing through both Lake Formation and Resource Access Manager (RAM). Additionally, we will discuss real-time sharing via event-driven services, temporary data sharing with S3, and sharing operational data from CloudWatch.
Chapter 3, Ingesting and Transforming Your Data with AWS Glue, explores different features of AWS Glue when building data pipelines and data lakes. It covers the multiple tools and engines provided for the different kinds of users, from visual jobs with little or no code to managed notebooks and jobs using the different data handling APIs provided.
Chapter 4, A Deep Dive into AWS Orchestration Frameworks, explores the essential services and techniques for managing data workflows and pipelines on AWS. You’ll learn how to define a simple workflow using AWS Glue Workflows, set up event-driven orchestration with Amazon EventBridge, and create data workflows with AWS Step Functions. We also cover managing data pipelines using Amazon MWAA, monitoring their health, and setting up a data ingestion pipeline with AWS Glue to bring data from a JDBC database into a catalog table.
Chapter 5, Running Big Data Workloads with Amazon EMR, teaches how to make the most of your AWS EMR clusters and explore the service features that enable them to be customizable, efficient, scalable, and robust.
Chapter 6, Governing Your Platform, presents the key aspects of data governance within AWS. This includes data protection techniques such as data masking in Redshift and classifying sensitive information using Maice. We will also cover ensuring data quality with Glue quality checks. Additionally, we will discuss resource governance to enforce best practices and maintain a secure, compliant infrastructure using AWS Config and resource tagging.
Chapter 7, Data Quality Management, covers how to use AWS Glue Deequ and AWS DataBrew to automate data quality checks and maintain high standards across your datasets. You will learn how to define and enforce data quality rules and monitor data quality metrics. This chapter also provides practical examples and recipes for integrating these tools into your data workflows, ensuring that your data is accurate, complete, and reliable for analysis.
Chapter 8, DevOps – Defining IaC and Building CI/CD Pipelines, explores multiple ways to automate AWS services and CI/CD deployment pipelines, the pros and cons of each tool, and examples of common data product deployments to illustrate DevOps best practices.
Chapter 9, Monitoring Data Lake Cloud Infrastructure, provides a comprehensive guide to the day-to-day operations of a cloud-based data platform. It covers key topics such as monitoring, logging, and alerting using AWS services such as CloudWatch, CloudTrail, and X-Ray. You will learn how to set up dashboards to monitor the health and performance of your data platform, troubleshoot issues, and ensure high availability and reliability. This chapter also discusses best practices for cost management and scaling operations to meet changing demands, making it an essential resource for anyone responsible for the ongoing maintenance and optimization of a data platform.
Chapter 10, Building a Serving Layer with AWS Analytics Services, guides you through the process of building an efficient serving layer using AWS Redshift, Athena, and QuickSight. The serving layer is where your data becomes accessible to end-users for analysis and reporting. In this chapter, you will learn how to load data from your data lake into Redshift, query it using Redshift Spectrum and Athena, and visualize it using QuickSight. This chapter also covers best practices for managing different QuickSight environments and migrating assets between them. By the end of this chapter, you will have the knowledge to create a powerful and user-friendly analytics layer that meets the needs of your organization.
Chapter 11, Migrating to AWS – Steps, Strategies, and Best Practices for Modernizing Your Analytics and Big Data Workloads, presents a theoretical framework for migrating data and workloads to AWS. It explores key concepts, strategies, and best practices for planning and executing a successful migration. You’ll learn about various migration approaches—rehosting, replatforming, and refactoring—and how to choose the best option for your organization’s needs. The chapter also addresses critical challenges and considerations, such as data security, compliance, and minimizing downtime, preparing you to navigate the complexities of cloud migration with confidence.
Chapter 12, Harnessing the Power of AWS for Seamless Data Warehouse Migration, explores the key strategies for efficiently migrating data warehouses to AWS. You’ll learn how to generate a migration assessment report using the AWS Schema Conversion Tool (SCT), extract and transfer data with AWS Database Migration Service (DMS), and handle large-scale migrations with the AWS Snow Family. You’ll also learn how to streamline your data migration, ensuring minimal disruption and maximum efficiency while transitioning to the cloud.
Chapter 13, Strategizing Hadoop Migrations – Cost, Data, and Workflow Modernization with AWS, guides you through essential recipes for migrating your on-premises Hadoop ecosystem to AWS, covering a range of critical tasks. You’ll learn about cost analysis using the AWS Total Cost of Ownership (TCO) calculators and the Hadoop Migration Assessment tool. You’ll also learn how to choose the right storage solution, migrate HDFS data using AWS DataSync, and transition key components such as the Hive Metastore and Apache Oozie workflows to AWS EMR. We also cover setting up a secure network connection to your EMR cluster, seamless HBase migration to AWS, and transitioning HBase to DynamoDB.