You're reading from Data Engineering with AWS Cookbook A recipe-based approach to help you tackle data engineering problems with AWS services

Product type Paperback

Published in Nov 2024

Publisher Packt

ISBN-13 9781805127284

Length 528 pages

Edition 1st Edition

Languages

Python

Tools

AWS Glue

Concepts

Data Engineering

Authors (4):

Viquar Khan

Gonzalo Herreros González

Huda Nofal

Trâm Ngọc Phạm

View More author details

Table of Contents (16) Chapters

Preface

1. Chapter 1: Managing Data Lake Storage

2. Chapter 2: Sharing Your Data Across Environments and Accounts FREE CHAPTER

3. Chapter 3: Ingesting and Transforming Your Data with AWS Glue

4. Chapter 4: A Deep Dive into AWS Orchestration Frameworks

5. Chapter 5: Running Big Data Workloads with Amazon EMR

6. Chapter 6: Governing Your Platform

7. Chapter 7: Data Quality Management

8. Chapter 8: DevOps – Defining IaC and Building CI/CD Pipelines

9. Chapter 9: Monitoring Data Lake Cloud Infrastructure

10. Chapter 10: Building a Serving Layer with AWS Analytics Services

11. Chapter 11: Migrating to AWS – Steps, Strategies, and Best Practices for Modernizing Your Analytics and Big Data Workloads

12. Chapter 12: Harnessing the Power of AWS for Seamless Data Warehouse Migration

13. Chapter 13: Strategizing Hadoop Migrations – Cost, Data, and Workflow Modernization with AWS

14. Index

Why subscribe?

15. Other Books You May Enjoy

Replicating your data

AWS S3 replication is an automatic asynchronous process that involves copying objects to one or multiple destination buckets. Replication can be configured across buckets in the same AWS region with Same-Region Replication, which can be useful for scenarios such as isolating different workloads, segregating data for different teams, or achieving compliance requirements. Replication can also be configured for buckets across different AWS regions with Cross-Region Replication (CRR), which helps in reducing latency for accessing data, especially for enterprises with a large number of locations, by maintaining multiple copies of the objects in different geographies or different regions. It provides compliance and data redundancy for improved performance, availability, and disaster recovery capabilities.

In this recipe, we’ll learn how to set up replication between two buckets in different AWS regions and the same AWS account.

Getting ready

You need to have an S3 bucket in the destination AWS region to act as a target for the replication. Also, S3 versioning must be enabled for both the source and destination buckets.

How to do it…

Sign in to the AWS Management Console (https://console.aws.amazon.com/console/home?nc2=h_ct&src=header-signin) and navigate to the S3 service.
In the Buckets list, choose the source bucket you want to replicate.
Go to the Management tab and select Create replication rule under Replication rules.
Under Replication rule name in the Replication rule configuration section, give your rule a unique name.
Under Status, either keep it Enabled for the rule to take effect once you save it or change it to Disabled to enable it later as required:

Figure 1.10 – Replication rule configuration

If this is the first replication rule for the bucket, Priority will be set to 0. Subsequent rules that are added will be assigned higher priorities. When multiple rules share the same destination, the rule with the highest priority takes precedence during execution, typically the one created last. If you wish to control the priority for each rule, you can achieve this by setting the rule using XML. For guidance on how to configure this, refer to the See also section.
In the Source bucket section, you have the option to replicate all objects in the bucket by selecting Apply to all objects in the bucket or you can narrow it down to specific objects by selecting Limit the scope of this rule using one or more filters and specifying a Prefix value (for example, logs_ or logs/) to filter objects. Additionally, you have the option to replicate objects based on their tags. Simply choose Add tag and input key-value pairs. This process can be repeated so that you can include multiple tags:

Figure 1.11 – Source bucket configuration

Under Destination, select Choose a bucket in this account and enter or browse for the destination bucket name.
Under IAM role, select Choose from existing IAM roles, then choose Create new role from the drop-down list.
Under Destination storage class, you can select Change the storage class for the replicated objects and choose one of the storage classes to be set for the replicated objects in the destination bucket.
Click on Save to save your changes.

How it works…

By adding this replication rule, you grant the source bucket permission to replicate objects to the destination bucket in the said region. Once the replication process is complete, the destination bucket will contain a copy of the objects from the source bucket. The objects in the destination bucket will have the same ownership, permissions, and metadata as the source objects. When you enable replication to your bucket, several background processes occur to facilitate this process. S3 continuously monitors changes to objects in your source bucket. Once a change is detected, S3 generates a replication request for the corresponding objects and initiates the process of transferring the data from the source to the destination bucket.

There’s more…

There are additional options that you can enable while setting the replication rule under Additional replication options. The Replication metrics option enables you to monitor the replication progress with S3 Replication metrics. It does this by tracking bytes pending, operations pending, and replication latency. The Replication Time Control (RTC) option can be beneficial if you have a strict service-level agreement (SLA) for data replication as it will ensure that approximately 99% of your objects will be replicated within a 15-minute timeframe. It also enables replication metrics to notify you of any instances of delayed object replication. The Delete marker replication option will replicate object versions with a delete marker. Finally, the Replica modification sync option will replicate the metadata changes of objects.