Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
AWS for Solutions Architects

You're reading from   AWS for Solutions Architects The definitive guide to AWS Solutions Architecture for migrating to, building, scaling, and succeeding in the cloud

Arrow left icon
Product type Paperback
Published in Apr 2023
Publisher Packt
ISBN-13 9781803238951
Length 692 pages
Edition 2nd Edition
Tools
Arrow right icon
Authors (4):
Arrow left icon
Neelanjali Srivastav Neelanjali Srivastav
Author Profile Icon Neelanjali Srivastav
Neelanjali Srivastav
Saurabh Shrivastava Saurabh Shrivastava
Author Profile Icon Saurabh Shrivastava
Saurabh Shrivastava
Alberto Artasanchez Alberto Artasanchez
Author Profile Icon Alberto Artasanchez
Alberto Artasanchez
Imtiaz Sayed Imtiaz Sayed
Author Profile Icon Imtiaz Sayed
Imtiaz Sayed
Arrow right icon
View More author details
Toc

Table of Contents (19) Chapters Close

AWS for Solutions Architects, Second Edition: Design your cloud infrastructure by implementing DevOps, containers, and Amazon Web Services
1 Understanding AWS Principles and Key Characteristics FREE CHAPTER 2 Understanding AWS Well-Architected Framework and Getting Certified 3 Leveraging the Cloud for Digital Transformation 4 Networking in AWS 5 Storage in AWS – Choosing the Right Tool for the Job 6 Harnessing the Power of Cloud Computing 7 Selecting the Right Database Service 8 Best Practices for Application Security, Identity, and Compliance 9 Dive efficiency with Cloud Operation Automation and DevOps in AWS 10 Bigdata and streaming data processing in AWS 11 Datawarehouse, Data Query and Visualization in AWS 12 Machine Learning, IoT, and Blockchain in AWS 13 Containers in AWS 14 Microservice and Event-Driven Architectures 15 Domain-Driven Design 16 Data Lake Patterns – Integrating Your Data across the Enterprise 17 Availability, Reliability, and Scalability Patterns 18 AWS Hands-On Lab and Use Case

Deep diving into Amazon Athena

As mentioned previously, Amazon Athena is quite flexible and can handle simple and complex database queries using standard SQL. It supports joins and arrays. It can use a wide variety of file formats, including these:

  • CSV
  • JSON
  • ORC
  • Avro
  • Parquet

It also supports other formats, but these are the most common. In some cases, the files you are using have already been created, and you may have little flexibility regarding the format of these files. But for the cases where you can specify the file format, it's important to understand the advantages and disadvantages of these formats. In other cases, converting the files to another format may even make sense before using Amazon Athena. Let's take a quick look at these formats and understand when to use them.

CSV files

Comma-Separated Value (CSV) file is a file where a comma separator delineates each value, and a return character delineates each record or row. Remember that the separator does not necessarily have to be a comma. Other common delimiters are tabs and the pipe character (|).

JSON files

JavaScript Object Notation (JSON) is an open-standard file format. One of its advantages is that it's somewhat simple to read, mainly when it's indented and formatted. It's a replacement for the Extensible Markup Language (XML) file format, which, while similar, is more difficult to read. It consists of a series of potentially nested attribute-value pairs.

JSON is a language-agnostic data format. It was initially used with JavaScript, but quite a few programming languages now provide native support for it or provide libraries to create and parse JSON-formatted data.

IMPORTANT NOTE

The first two formats we mentioned are not compressed and are not optimized for use with Athena or speed up queries. The rest of the formats we will analyze are all optimized for fast retrieval and querying when used with Amazon Athena and other file-querying technologies.

ORC files

The Optimized Row Columnar (ORC) file format provides a practical method to store files. It was initially designed under the Apache Hive and Hadoop project and was created to overcome other file formats' issues and limitations. ORC files provide better performance when compared to uncompressed formats for reading, writing, and processing data.

Apache Avro files

Apache Avro is an open-source file format used to serialize data. It was originally designed for the Apache Hadoop project.

Apache Avro persists data using JSON format, allowing users of files to read and interpret them easily. However, the data is persisted in binary format, which has efficient and compact storage. An Avro file can use markers to divide big datasets into smaller files to simplify parallel processing. Some consumer services have a code generator that processes the file schema to generate code that enables access. Apache Avro doesn't need to do this, making it suitable for scripting languages.

An essential Avro characteristic is its support for dynamic data schemas that can be modified over time. Avro can process schema changes such as empty, new, and modified fields. Because of this, old scripts can process new data, and new scripts can process old data. Avro has APIs for the following, among others:

  • Python
  • Go
  • Ruby
  • Java
  • C
  • C++

Avro-formatted data can flow from one program to another even if the programs are written in different languages.

Apache Parquet files

Just because we are listing Parquet files at the end, don't assume they will be ignored. Parquet is an immensely popular format to use in combination with Amazon Athena.

Apache Parquet is another quite popular open-source file format. Apache Parquet has an efficient and performant design. It stores file contents in a flat columnar storage format. Contrast this storage method with the row-based approach used by comma- and tab-delimited files such as CSV and TSV.

Parquet is powered by an elegant assembly and shredding algorithm that is more efficient than simply flattening nested namespaces. Apache Parquet is well suited to operating on complex data at scale by using efficient data compression. This method is ideal for queries that require reading a few columns from a table with many columns. Apache Parquet can easily locate and scan only those columns, significantly reducing the traffic required to retrieve data.

In general, columnar storage and Apache Parquet delivers higher efficiency than a row-based approach such as CSV. While performing reads, a columnar storage method will skip over non-relevant columns and rows efficiently. Aggregation queries using this approach take less time than row-oriented databases. This results in lower billing and higher performance for data access.

Apache Parquet supports complex nested data structures. Parquet files are ideal for queries retrieving large amounts of data and can handle files that contain gigabytes of data without much difficulty.

Apache Parquet is built to support a variety of encoding and compression algorithms. Parquet is well suited to situations where columns have similar data types. This can make accessing and scanning files quite efficient. Apache Parquet works with various codes, enabling the compression of files in various ways.

In addition to Amazon Athena, Apache Parquet works with serverless technologies such as Google BigQuery, Google Dataproc, and Amazon Redshift Spectrum.

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image