You're reading from AWS for Solutions Architects The definitive guide to AWS Solutions Architecture for migrating to, building, scaling, and succeeding in the cloud

Product type Paperback

Published in Apr 2023

Publisher Packt

ISBN-13 9781803238951

Length 692 pages

Edition 2nd Edition

Tools

AWS

Concepts

Cloud Computing

Authors (4):

Neelanjali Srivastav

Saurabh Shrivastava

Alberto Artasanchez

Imtiaz Sayed

View More author details

Table of Contents (19) Chapters

AWS for Solutions Architects, Second Edition: Design your cloud infrastructure by implementing DevOps, containers, and Amazon Web Services

1 Understanding AWS Principles and Key Characteristics FREE CHAPTER

2 Understanding AWS Well-Architected Framework and Getting Certified

3 Leveraging the Cloud for Digital Transformation

4 Networking in AWS

5 Storage in AWS – Choosing the Right Tool for the Job

6 Harnessing the Power of Cloud Computing

7 Selecting the Right Database Service

8 Best Practices for Application Security, Identity, and Compliance

9 Dive efficiency with Cloud Operation Automation and DevOps in AWS

10 Bigdata and streaming data processing in AWS

11 Datawarehouse, Data Query and Visualization in AWS

12 Machine Learning, IoT, and Blockchain in AWS

13 Containers in AWS

14 Microservice and Event-Driven Architectures

15 Domain-Driven Design

16 Data Lake Patterns – Integrating Your Data across the Enterprise

17 Availability, Reliability, and Scalability Patterns

18 AWS Hands-On Lab and Use Case

Deep diving into Amazon Athena

As mentioned previously, Amazon Athena is quite flexible and can handle simple and complex database queries using standard SQL. It supports joins and arrays. It can use a wide variety of file formats, including these:

CSV
JSON
ORC
Avro
Parquet

It also supports other formats, but these are the most common. In some cases, the files you are using have already been created, and you may have little flexibility regarding the format of these files. But for the cases where you can specify the file format, it's important to understand the advantages and disadvantages of these formats. In other cases, converting the files to another format may even make sense before using Amazon Athena. Let's take a quick look at these formats and understand when to use them.

CSV files

A Comma-Separated Value (CSV) file is a file where a comma separator delineates each value, and a return character delineates each record or row. Remember that the separator does not necessarily have to be a comma. Other common delimiters are tabs and the pipe character (|).

JSON files

JavaScript Object Notation (JSON) is an open-standard file format. One of its advantages is that it's somewhat simple to read, mainly when it's indented and formatted. It's a replacement for the Extensible Markup Language (XML) file format, which, while similar, is more difficult to read. It consists of a series of potentially nested attribute-value pairs.

JSON is a language-agnostic data format. It was initially used with JavaScript, but quite a few programming languages now provide native support for it or provide libraries to create and parse JSON-formatted data.

IMPORTANT NOTE

The first two formats we mentioned are not compressed and are not optimized for use with Athena or speed up queries. The rest of the formats we will analyze are all optimized for fast retrieval and querying when used with Amazon Athena and other file-querying technologies.

ORC files

The Optimized Row Columnar (ORC) file format provides a practical method to store files. It was initially designed under the Apache Hive and Hadoop project and was created to overcome other file formats' issues and limitations. ORC files provide better performance when compared to uncompressed formats for reading, writing, and processing data.

Apache Avro files

Apache Avro is an open-source file format used to serialize data. It was originally designed for the Apache Hadoop project.

Apache Avro persists data using JSON format, allowing users of files to read and interpret them easily. However, the data is persisted in binary format, which has efficient and compact storage. An Avro file can use markers to divide big datasets into smaller files to simplify parallel processing. Some consumer services have a code generator that processes the file schema to generate code that enables access. Apache Avro doesn't need to do this, making it suitable for scripting languages.

An essential Avro characteristic is its support for dynamic data schemas that can be modified over time. Avro can process schema changes such as empty, new, and modified fields. Because of this, old scripts can process new data, and new scripts can process old data. Avro has APIs for the following, among others:

Python
Go
Ruby
Java
C
C++

Avro-formatted data can flow from one program to another even if the programs are written in different languages.

Apache Parquet files

Just because we are listing Parquet files at the end, don't assume they will be ignored. Parquet is an immensely popular format to use in combination with Amazon Athena.

Apache Parquet is another quite popular open-source file format. Apache Parquet has an efficient and performant design. It stores file contents in a flat columnar storage format. Contrast this storage method with the row-based approach used by comma- and tab-delimited files such as CSV and TSV.

Parquet is powered by an elegant assembly and shredding algorithm that is more efficient than simply flattening nested namespaces. Apache Parquet is well suited to operating on complex data at scale by using efficient data compression. This method is ideal for queries that require reading a few columns from a table with many columns. Apache Parquet can easily locate and scan only those columns, significantly reducing the traffic required to retrieve data.

In general, columnar storage and Apache Parquet delivers higher efficiency than a row-based approach such as CSV. While performing reads, a columnar storage method will skip over non-relevant columns and rows efficiently. Aggregation queries using this approach take less time than row-oriented databases. This results in lower billing and higher performance for data access.

Apache Parquet supports complex nested data structures. Parquet files are ideal for queries retrieving large amounts of data and can handle files that contain gigabytes of data without much difficulty.

Apache Parquet is built to support a variety of encoding and compression algorithms. Parquet is well suited to situations where columns have similar data types. This can make accessing and scanning files quite efficient. Apache Parquet works with various codes, enabling the compression of files in various ways.

In addition to Amazon Athena, Apache Parquet works with serverless technologies such as Google BigQuery, Google Dataproc, and Amazon Redshift Spectrum.

The rest of the chapter is locked