You're reading from Data Engineering with AWS Cookbook A recipe-based approach to help you tackle data engineering problems with AWS services

Product type Paperback

Published in Nov 2024

Publisher Packt

ISBN-13 9781805127284

Length 528 pages

Edition 1st Edition

Languages

Python

Tools

AWS Glue

Concepts

Data Engineering

Authors (4):

Viquar Khan

Gonzalo Herreros González

Huda Nofal

Trâm Ngọc Phạm

View More author details

Table of Contents (16) Chapters

Preface

1. Chapter 1: Managing Data Lake Storage

2. Chapter 2: Sharing Your Data Across Environments and Accounts FREE CHAPTER

3. Chapter 3: Ingesting and Transforming Your Data with AWS Glue

4. Chapter 4: A Deep Dive into AWS Orchestration Frameworks

5. Chapter 5: Running Big Data Workloads with Amazon EMR

6. Chapter 6: Governing Your Platform

7. Chapter 7: Data Quality Management

8. Chapter 8: DevOps – Defining IaC and Building CI/CD Pipelines

9. Chapter 9: Monitoring Data Lake Cloud Infrastructure

10. Chapter 10: Building a Serving Layer with AWS Analytics Services

11. Chapter 11: Migrating to AWS – Steps, Strategies, and Best Practices for Modernizing Your Analytics and Big Data Workloads

12. Chapter 12: Harnessing the Power of AWS for Seamless Data Warehouse Migration

13. Chapter 13: Strategizing Hadoop Migrations – Cost, Data, and Workflow Modernization with AWS

14. Index

Why subscribe?

15. Other Books You May Enjoy

Setting up retention policies for your objects

Amazon S3’s storage lifecycle allows you to manage the lifecycle of objects in an S3 bucket based on predefined rules. The lifecycle management feature consists of two main actions: transitions and expiration. Transitions involve automatically moving objects between different storage classes based on a defined duration. This helps in optimizing costs by storing less frequently accessed data in a cheaper storage class. Expiration, on the other hand, allows users to set rules to automatically delete objects from an S3 bucket. These rules can be based on a specified duration. Additionally, you can apply a combination of transitions and expiration actions to objects. Amazon S3’s storage lifecycle provides flexibility and ease of management for users and it helps organizations optimize storage costs while ensuring that data is stored according to its relevance and access patterns.

In this recipe, we will learn how to set up a lifecycle policy to archive objects in S3 Glacier after a certain period and then expire them.

Getting ready

To complete this recipe, you need to have a Glacier vault, which is a separate storage container that can be used to store archives, independent from S3. You can create one by following these steps:

Open the AWS Management Console (https://console.aws.amazon.com/console/home?nc2=h_ct&src=header-signin) and navigate to the Glacier service.
Click on Create vault to start creating a new Glacier vault.
Provide a unique and descriptive name for your vault in the Vault name field.
Optionally, you can choose to receive notifications for events by clicking Turn on notifications under the Event notifications section.
Click on Create to create the vault.

How to do it…

Open the AWS Management Console (https://console.aws.amazon.com/console/home?nc2=h_ct&src=header-signin) and navigate to the S3 service.
Select the desired bucket for which you want to configure the lifecycle policy and navigate to the Management tab.
In the left panel, select Lifecycle and click on Create lifecycle rule.
Under Rule name, name the lifecycle rule to identify it.
Under Choose a rule scope, you can choose Apply to all objects in the bucket or Limit the scope of this rule using one or more filters to specify the objects for which the rule will be applied. You can use one of the following filters or a combination of them:
- Filter objects based on prefixes (for example, logs)
- Filter objects based on tags; you can add multiple key-value pair tags to filter on
- Filter objects based on object size by setting Specify minimum object size and/or Specify maximum object size and specifying the size value and unit
The following screenshot shows a rule that’s been restricted to a set of objects based on a prefix:

Figure 1.4 – Lifecycle rule configuration

Under Lifecycle rule actions, select the following options:
- Move current versions of objects between storage classes. Then, choose one of the Glacier classes and set Days after object creation in which the object will be transitioned (for example, 60 days).
- Expire current versions of objects. Then, set Days after object creation in which the object will expire. Choose a value higher than the one you set for transitioning the object to Glacier (for example, 100).
Review the transition and expiration actions you have set and click on Create rule to apply the lifecycle policy to the bucket:

Figure 1.5 – Reviewing the lifecycle rule

Note

It may take some time for the lifecycle rule to be applied to all the selected objects, depending on the size of the bucket and the number of objects. The rule will affect existing files, not just new ones, so ensure that no applications are accessing files that will be archived or deleted as they will no longer be accessible via direct S3 retrieval.

How it works…

After you save the lifecycle rule, Amazon S3 will periodically evaluate it to find objects that meet the criteria specified in the lifecycle rule. In this recipe, the object will remain in its default storage type for the specified period (for example, 60 days) after which it will automatically be moved to the Glacier storage class. This transition is handled transparently, and the object’s metadata and properties remain unchanged. Once the objects are transitioned to Glacier, they are stored in a Glacier vault and become part of the Glacier storage infrastructure. Objects will then remain in Glacier for the remaining period of expiry (for example, 40 days), after which they will expire and be permanently deleted from your S3 bucket.

Please note that once the objects have expired, they will be queued for deletion, so it might take a few days after the object reaches the end of its lifetime for it to be deleted.

There’s more…

Lifecycle configuration can be specified as an XML when using the S3 API or AWS console, which can be helpful if you are planning on using the same lifecycle rules on multiple buckets. You can read more on setting this up at https://docs.aws.amazon.com/AmazonS3/latest/userguide/intro-lifecycle-rules.html.

You're reading from Data Engineering with AWS Cookbook A recipe-based approach to help you tackle data engineering problems with AWS services

Table of Contents (16) Chapters

Setting up retention policies for your objects

Getting ready

How to do it…

How it works…

There’s more…

See also

Authors (6)

Personalised recommendations for you

You're reading from Data Engineering with AWS Cookbook A recipe-based approach to help you tackle data engineering problems with AWS services

Table of Contents (16) Chapters

Setting up retention policies for your objects

Getting ready

How to do it…

How it works…

There’s more…

See also

Unlock this book and the full library FREE for 7 days

Authors (6)

Personalised recommendations for you