Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Actionable Insights with Amazon QuickSight
Actionable Insights with Amazon QuickSight

Actionable Insights with Amazon QuickSight: Develop stunning data visualizations and machine learning-driven insights with Amazon QuickSight

eBook
€8.99 €29.99
Paperback
€36.99
Subscription
Free Trial
Renews at €18.99p/m

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Product feature icon AI Assistant (beta) to help accelerate your learning
OR
Modal Close icon
Payment Processing...
tick Completed

Shipping Address

Billing Address

Shipping Methods
Table of content icon View table of contents Preview book icon Preview Book

Actionable Insights with Amazon QuickSight

Chapter 1: Introducing the AWS Analytics Ecosystem

As data increases in both volume and variety, organizations from all verticals are adopting cloud analytics services for their data analytics. AWS offers a number of analytics services covering data lakes, data warehousing, big data processing, extract, transform, load (ETL), and data visualization. In this chapter, we will introduce the AWS analytics ecosystem. Some of the services we discuss here will be mentioned again later in the book.

First, we will map the AWS services into categories. Then, we will discuss how Amazon QuickSight fits into the wider AWS analytics ecosystem. We will look more closely at a modern Lake House architecture and we will discuss its benefits and its components. Finally, we will provide a step-by-step guide to set up a data Lake House architecture on AWS and load and query a demo data sample. Some of this information may already be familiar to you, but let's go back over the basics.

In this chapter, we will cover the following topics:

  • Discovering the AWS analytics ecosystem
  • Exploring the data Lake House architecture on AWS
  • Creating a basic Lake House architecture

Technical requirements

To follow along with this chapter, you will need the following pre-requisites:

  • An AWS account with console access
  • AWS CLI access

The code sample for this chapter can be accessed on the GitHub repository for this book at https://github.com/PacktPublishing/Actionable-Insights-with-Amazon-QuickSight/tree/main/chapter_1.

Discovering the AWS analytics ecosystem

AWS provides a large number of analytics services. In addition to that, AWS has a number of partners who specialize in data analytics and offer analytics solutions that run on the AWS infrastructure. Partner solutions are not in the scope of this section, however. This section focuses on the AWS fully managed analytics services. In order to list the services, we will first define the specific categories related to analytics functions. Machine learning and predictive analytics are also out of the scope of this chapter. For every service category, we will then list the AWS services available, and for each service, we will provide a high-level description. Figure 1.1 depicts the commonly used AWS analytics services.

Figure 1.1 – AWS analytics services

Figure 1.1 – AWS analytics services

Business intelligence

More and more organizations aspire to be data-driven and use data to drive their strategic decisions. Business intelligence (BI) tools help organizations to transform data into actionable insights. With the use of BI tools, users can analyze data and then present their findings in reports or dashboards. These reports or dashboards can then be consumed by business users who are interested in getting a picture of the state of the business.

In 2015, AWS launched Amazon QuickSight, a cloud-native BI tool. Since then, AWS has added new features to QuickSight, enriching the standard dashboard functionality with machine learning capabilities and offering embedded dashboard functionality. Amazon QuickSight is the main technology we will be covering in this book. Over the next few chapters, we will start with the basic functionality of Amazon QuickSight, and then we will explore more advanced features. Where possible, we will use practical examples that can be repeated in your own development environment, to give you hands-on experience with Amazon QuickSight.

Data warehousing

Data warehouses are repositories of data; they are important components of the BI process. Data stored in data warehouses is typically structured. Traditionally, data is ingested and centralized into data warehouses from different operational data stores. Data warehouses are optimized to run analytical queries over large amounts of data. The results of analytical queries are usually calculated after an aggregation over multiple rows from one or more tables. BI applications use analytical queries to aggregate data and visualize it. It is a common architectural approach to use a data warehouse to serve data to a BI application.

Back in 2012, AWS launched Amazon Redshift, a cloud-native, fully managed data warehouse service. Today, Redshift is one of the most popular cloud data warehouses with thousands of organizations from different verticals using it to analyze their data. Other popular cloud data warehouses include Snowflake and Google BigQuery. Amazon Redshift integrates with most BI tools and it integrates natively with Amazon QuickSight. We will discuss this topic in more detail in Chapter 3, Preparing Data with Amazon QuickSight, when we look more closely into Amazon QuickSight-supported data sources.

Data lake storage and governance

A data lake is a repository of data where organizations can easily centralize all of their data and apply it in different use cases such as reporting, visualization, big data analytics, and predictive analytics. Data stored in data lakes can be structured or semi-structured. Usually, data is ingested into the data lake in its raw format, and is then transformed and stored back into the data lake for further processing and analysis. A cloud data lake typically uses a cloud object store to store data. AWS introduced Amazon Simple Storage Service (S3) in March 2006, offering developers a highly scalable, reliable, and low-latency data storage infrastructure at very low cost. Amazon S3 can store an unlimited amount of data, a particularly useful feature for data lakes. Organizations have one less thing to worry about because they don't need to think about scaling their storage as the amount of data stored grows.

While scaling data lake storage is something that organizations and CIOs don't need to worry about much anymore, data lake governance needs to be considered carefully. Data lakes do not enforce data schemas or data formats and, without any governance, data lakes can degrade into unusable data repositories, often referred to as data swamps. AWS offers a number of services for data governance.

The AWS Glue Catalog is part of the AWS Glue service. It is a fully managed Apache Hive metastore-compatible data catalog. Big data applications (for example, Apache Spark, Apache Hive, Presto, and so on) use the metadata in the catalog to locate and parse data. The AWS Glue Catalog is a technical metadata repository and can catalog data in Amazon S3, and a number of relational or non-relational data stores including Redshift, Aurora, and DynamoDB, among others.

AWS Lake Formation runs on top of AWS Glue and Amazon S3 and provides a governance layer and access layer for data lakes on Amazon S3. It also provides a set of reusable ETL jobs, called blueprints, that can be used to perform common ETL tasks (for example, loading data from a relational data store into an S3 data lake). Lake Formation allows users to manage access permissions, using a familiar GRANT REVOKE syntax that you might have seen in relational database management systems (RDBMSes).

Amazon Macie is an AWS service for data protection. It provides an inventory of Amazon S3 buckets and it uses machine learning to identify and alert its users about sensitive data, such as personally identifiable information (PII).

Finally, and perhaps most importantly, AWS Identity and Access Management (IAM) is a fundamental AWS service that allows users to assign permissions to principals (for example, users, groups, or roles) and explicitly allow or deny access to AWS resources including data lake locations or tables in the data catalog.

Ad hoc analytics

Ad hoc analytics refers to getting answers from the data on an as-needed basis. Contrary with what happens with scheduled reports, ad hoc querying is initiated by a user when they need to get specific answers from their data. The user typically uses SQL via a workbench type of application or other analytics frameworks (for instance, Apache Spark) using notebook environments or other BI applications. AWS has a number of analytics services that can be used for ad hoc analytics.

Amazon Redshift can be used for ad hoc analysis of data. For ad hoc querying, users will typically connect to Amazon Redshift using a query editor application with the Redshift JDBC/ODBC drivers. Notebook integrations or BI tool integrations are also possible for ad hoc analysis. AWS offers a number of managed notebook environments such as EMR notebooks and SageMaker notebooks. Amazon Redshift also allows its users to query data that is stored outside the data warehouse. Amazon Redshift Spectrum allows Redshift users to query data stored in Amazon S3, eliminating the need to load the data first before querying. Redshift's federated querying capability allows users to query live data in operational data stores such as PostgreSQL and MySQL.

For big data and data lakes, Presto is a popular choice for ad hoc analysis. Presto provides a high-performance parallel SQL query engine. Amazon Athena lets users run Presto queries in a scalable serverless environment. Amazon QuickSight natively supports Amazon Athena. We will talk more about this native integration in Chapter 3, Preparing Data with Amazon QuickSight. Amazon EMR is a fully managed Hadoop cluster, and it comes with a range of applications from the open source big data ecosystem. Presto has two community projects, PrestoSQL and PrestoDB, both of which are part of the Amazon EMR service. Other options included with EMR are Hive on EMR and Spark on EMR.

Extract, transform, load

ETL is a term used to describe a set of processes to extract, transform, and load data usually for analytical purposes. Organizations gather data from different data sources and centralize them in a central data repository. Data from different sources typically has different schemas and different conventions and standards, and therefore it can be challenging to combine them to get the required answers. For that reason, data needs to transformed so that it can work together. For example, cleaning the data, applying certain data quality thresholds, and standardizing to a specific standard (for instance, date and time formats used) are all important tasks to ensure the data is useable. A visual representation of the ETL process is shown in the following figure.

Figure 1.2 – The ETL process

Figure 1.2 – The ETL process

AWS Glue is a fully managed ETL service offered by AWS. When it was first introduced in 2017, Glue ETL offered an Apache Spark environment optimized for ETL. Now, Glue ETL offers a wider range of options:

  • PySpark – Apache Spark using Python
  • Spark with Scala – Apache Spark with Scala
  • Python shell – For smaller ETL jobs that don't need a Spark cluster
  • Glue Studio and Glue Databrew – Visual approach to ETL without the need to write code

Amazon EMR transient clusters, with applications such as Spark or Hive, can be leveraged for ETL workloads. ETL workloads can be bulk or streaming: streaming ETL workloads usually need to be up and running constantly, or at least for as long as the source stream is on; batch ETL workloads don't need to run at all times and they can stop once the data is loaded into the target system. This type of workload fits nicely with the flexibility of the cloud. With the cloud, data architects don't need to think of Hadoop clusters as big monolithic clusters. Instead, users prefer purpose-built transient clusters, optimized and sized to handle specific workloads and data loads.

Now that we've had our overview of the AWS analytics ecosystem, let's learn about data Lake Houses and how they are built.

Exploring the data Lake House architecture on AWS

The data Lake House architecture is a modern data analytics architecture: as the name suggests, it combines the data lake and the data warehouse into a seamless system. This approach extends the traditional data warehouse approach and opens up new possibilities for data analytics. For this reason, it is important to understand this architecture, which can be used as a data backend for Amazon QuickSight or other BI applications. To understand the architecture better, let's first start by understanding the differences between a data lake and data warehouse.

Data lakes versus data warehouses

Data lakes and data warehouses are designed to consume large amounts of data for analytics purposes. Data warehouses are traditional database systems, used by organizations and enterprises for years. Data lakes, on the other side, are relatively young implementations that emerged from the big data and Hadoop ecosystems. Tables stored in data warehouses need to have clearly defined schemas. The schema needs to be defined upfront, before any data is added. This approach is called schema on write, and it ensures that data conforms to a specific structure before being ingested into the data warehouse. However, it can be less flexible, and it may introduce complexity when dealing with evolving schemas. Evolving schemas are an increasingly common scenario because organizations need to capture more and more data points from their customer interactions to drive data-driven decisions.

On the other side, data lakes don't enforce a schema upfront. Instead, applications that have the required permissions can write data to a data lake. Structure and data formats aren't enforced by the data lake: it is a responsibility of the writing application.

Data stored in a data lake has few to no limitations regarding its format: it can be structured, semi-structured, or completely unstructured. For many datasets, a schema can be inferred, either because the data is semi-structured (CSV, JSON, and others), or they follow patterns that can be identified after applying regular expressions and extracting specific columns. In data lakes, the schema is inferred when the data is read by the querying application. This approach is called schema on read, and it gives an organization flexibility regarding the data type stored. However, it also introduces challenges with data complexity and enforcing data quality.

For that reason, it is common that data that lands into the data lake goes through a series of transformations to get to a stage where it is useable. The first stage, often referred to as the raw layer, is where the data first lands, and it is stored as is.

After the data has landed, the first series of transformations is applied and the data is stored at the processed layer. Since the data can be of any format, the types of possible transformations are limitless. To give just some examples, data quality functions can be applied at this stage to remove incomplete rows and standardize the data in line with a specific datetime or time zone format. Other data engineering activities can also be performed at this stage, such as converting data into different file data formats optimized for analytics, or organizing them into folders using specific information (usually temporal) that can be later used as a partition column by the querying application.

Finally, data can then be converted for specific use cases and landed into the target layer. As an example, data can be transformed in a way that is relevant for a specific machine learning algorithm to work with the data. Another use case could be BI applications, such as Amazon QuickSight, where data can be pre-joined or aggregated and therefore reduced from a large dataset into a smaller dataset that is easier to visualize. Additional data engineering can be applied at this stage to optimize for performance.

Figure 1.3 – Data lake layers

Figure 1.3 – Data lake layers

The data warehouse and data lake architectures are now being challenged by a new, hybrid type of storage: the data Lake House.

Lake House architecture on AWS

This section will look more closely at an example Lake House architecture on AWS using AWS managed services. Let's start by defining the key components of the Lake House architecture:

  • Amazon Redshift is the data warehouse service.
  • Amazon S3 is the object store that can be used for cloud data lake storage.
  • AWS Glue is the data lake catalog to store technical metadata.

    Note

    AWS Glue Catalog tables can be stored in Amazon Redshift, providing a unified metadata catalog across both the data warehouse and the S3 data lake.

Amazon Redshift supports functionality that allows it to interact with the data warehouse. Let's look at those features in more detail.

Ability to query the data lake from the data warehouse

Redshift Spectrum is a feature of Redshift that allows you to perform SQL queries against data in the S3 data lake. The queries are triggered directly from the data warehouse, and therefore you don't need to connect to a different environment to submit your queries. You need to define the Spectrum tables as external tables on their data warehouse. The Redshift cluster also needs to have permission to access the data lake S3 location(s). The Redshift cluster will need to be assigned an IAM role, which needs to have access to the desired S3 locations.

Another key characteristic of Redshift Spectrum is that the Spectrum queries are running in the Spectrum nodes that are outside of the Redshift cluster. This effectively extends the Redshift cluster with additional compute capacity when data lake data needs to be queried.

Finally, Spectrum tables and Redshift tables can be combined and joined. Without this feature, you would have to move data and collocate it before joining it.

Ability to load data from the data lake

Redshift can efficiently load data from the S3 data lake. Specifically, Redshift's COPY command can load data in parallel from Amazon S3. You (at a minimum) need to define a table name, the data location (commonly S3), and the authorization to access the data in the source location. When loading multiple files from S3, Redshift parallelizes the loading by allocating each file to a Redshift slice (the unit of processing in Redshift).

Ability to unload data to the data lake

Redshift also comes with the ability to unload data from the data warehouse back to the data lake. Specifically, the UNLOAD command unloads the result of the query onto Amazon S3. You (as a minimum) need to specify the S3 location and the authorization. There are more options, such as defining the file format (using the FORMAT AS option) or applying partitioning (using the PARTITION BY option), and others.

In the following diagram, we see an example data pipeline that is using both a data warehouse and a data lake on AWS. Data is loaded from the operational data stores into the Amazon S3 object store in the raw layer of the data lake. Then, with a set of ETL jobs, the data reaches a stage that can be loaded into the data warehouse for BI purposes. For cost-effectiveness, you might not want to load all the data into the warehouse. Instead, you might want to leave the data in the data lake but have the ability to query the data when needed. This architecture considers the temperature of the data (how frequently the data is accessed) to determine the best storage. Hot data that needs to be accessed frequently is loaded into the data warehouse, while colder data remains in the data lake, a cheaper long-term storage option.

Figure 1.4 – Example data pipeline on AWS

Figure 1.4 – Example data pipeline on AWS

Now that we have had an overview of the Lake House architecture on AWS, let's build a basic data Lake House architecture on AWS.

Creating a basic Lake House architecture

In this section, we will go through a hands-on example to create a basic data Lake House architecture. This tutorial will use the AWS CLI and the AWS console. By the end of this section, we will have spun up a working data lake and a data warehouse environment with demo data loaded.

Important note

The resources for this tutorial might introduce charges to your AWS account. Once you finish with the exercise, make sure you clean up the resources to prevent incurring further charges.

Creating the data lake storage

In this step, we will add the data lake storage. Then we will upload a demo dataset and will discover its schema automatically.

Step 1 – creating the S3 bucket

Let's begin:

  1. If you haven't installed it already, follow the AWS documentation to install and configure the AWS CLI. To complete this tutorial, you will need to use a role that has access to the S3, Glue, Redshift, and IAM services: https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html.
  2. First, let's create the S3 bucket as the data lake storage. Your S3 bucket needs to have a globally unique name. For that reason, we should introduce some randomness to it. Let's pick a random set of 10 characters and numbers. For this tutorial, you should choose your own random set of characters for this string; for example, SF482XHS7M.

    We will use the random string in the data lake name, to ensure it is globally unique.

    Let's use my-data-lake-<random string> as the bucket name.

  3. To create a bucket, we can type the following command into the CLI. Replace the following random string with your random string:
    % aws s3api create-bucket --bucket data-lake-xxxxxxxxxx --region us-east-1

    And the response should look like this:

    {
        "Location": "/data-lake-xxxxxxxxxx"
    }

Step 2 – adding data into the data lake

Now let's add some data. For this exercise, we will use a subset of the New York City Taxi and Limousine Commission (TLC) Trip Record Data:

  1. Let's first have a look at the dataset:
    % aws s3 ls 's3://nyc-tlc/trip data/' --no-sign-request

    This command will return all the files in the open S3 location:

    2016-08-11 15:32:21   85733063 fhv_tripdata_2015-01.csv
    2016-08-11 15:33:04   97863482 fhv_tripdata_2015-02.csv
    2016-08-11 15:33:40  102220197 fhv_tripdata_2015-03.csv
    …
    2021-02-26 16:54:00  138989555 yellow_tripdata_2020-11.csv
    2021-02-26 16:54:00  134481400 yellow_tripdata_2020-12.csv 

    We don't need to download all of them. For this tutorial, we will copy only the files for 2020.

  2. We can use the S3 CLI exclude and include parameters to apply a pattern to match the desired filenames. The command to copy is as follows:
    % aws s3 cp "s3://nyc-tlc/trip data/" s3://data-lake-xxxxxxxxxx/yellowtrips/ --recursive --exclude "*" --include "yellow_tripdata_2020*" 
  3. Once completed, we can then verify that the files exist in our environment with the aws s3 ls command, which lists the files under a specific S3 location:
    % aws s3 ls s3://data-lake-xxxxxxxxxx/yellowtrips/
    2021-03-27 16:53:41  593610736 yellow_tripdata_2020-01.csv
    2021-03-27 16:53:41  584190585 yellow_tripdata_2020-02.csv
    2021-03-27 16:53:42  278288608 yellow_tripdata_2020-03.csv
    2021-03-27 16:53:41   21662261 yellow_tripdata_2020-04.csv
    2021-03-27 16:53:43   31641590 yellow_tripdata_2020-05.csv
    2021-03-27 16:53:42   50277193 yellow_tripdata_2020-06.csv
    2021-03-27 16:53:44   73326707 yellow_tripdata_2020-07.csv
    2021-03-27 16:53:46   92411545 yellow_tripdata_2020-08.csv
    2021-03-27 16:53:50  123394595 yellow_tripdata_2020-09.csv
    2021-03-27 16:53:54  154917592 yellow_tripdata_2020-10.csv
    2021-03-27 16:53:57  138989555 yellow_tripdata_2020-11.csv
    2021-03-27 16:53:58  134481400 yellow_tripdata_2020-12.csv

    Note

    You can use data in a shared data lake as part of your data lake without the need to actually copy it across to your data lake.

Step 3 – identifying the schema

The next step is to identify the schema of the dataset. For this purpose, we will use the AWS Glue crawlers. AWS Glue crawlers crawl through the data to detect the schema. If a schema can be determined (remember there is no guarantee that the data has a specific schema) then Glue crawlers will populate the Glue Catalog with the schemas identified after crawling the data. Glue tables always belong to a Glue database. A database in Glue is just a logical repository of tables in the Glue Catalog:

  1. Let's start by creating a database using the create-database command:
    % aws glue create-database --database-input "{\"Name\":\"my-data-lake-db\"}" --region us-east-1
  2. We can verify the successful database creation using the get-databases command:
    % aws glue get-databases --region us-east-1
    {
        "DatabaseList": [
            {
                "Name": "default", 
                "CreateTime": 1553517157.0
            }, 
         
            {
                "Name": "my-data-lake-db", 
                "CreateTime": 1616865129.0
            }
        ]
    }
  3. Before we create the Glue Crawler, we need to create an IAM role that will be assigned to the Crawler and allow it to access the data in the data lake. The crawler doesn't need to write to the data lake location, therefore only the read access permission is needed. To give the required permissions to a role, we need to attach policies that define the permissions. Let's define a policy document that allows read access to our data lake:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "s3:GetObject",
                    "s3:ListBucket"
                ],
                "Resource": [
                    "arn:aws:s3:::data-lake-xxxxxxxxxx",
                    "arn:aws:s3:::data-lake-xxxxxxxxxx/*"
                ]
            }
            ]
    }

    The preceding policy document allows the policy holder to use the S3 ListBucket and the GetObject API. The crawler will use ListBucket to list the objects in our data lake bucket and getObject to read objects as it crawls data. This policy restricts access to the data lake bucket only.

  4. Now, let's create a file and copy the policy text. Replace the random string in the data lake name with the random string in your environment. I used vim, but you can use any text editor:
    % vim policy
  5. Then, let's create the IAM policy using the create-policy CLI command:
    % aws iam create-policy --policy-name DataLakeReadAccess --policy-document file://policy 

    The preceding command created the policy and we should get a confirmation JSON object back. Note the policy ARN, as we will use it in a later step.

  6. Next, let's create the IAM role that the Glue crawler will assume. First, let's define the role policy document:
    {
                "Version": "2012-10-17",
                "Statement": [
                    {
                        "Action": "sts:AssumeRole",
                        "Effect": "Allow",
                        "Principal": {
                            "Service": "glue.amazonaws.com"
                        }
                    }
                ]
            }
  7. Then create a file called role-policy and copy in the preceding JSON document:
    % vim role-policy

    This role policy document allows the Glue service to assume the role we will create.

  8. To create the role, we will use the iam create-role CLI command:
    % aws iam create-role --role-name GlueCrawlerRole --assume-role-policy-document file://role-policy

    We should get a confirmation JSON message after running the command.

  9. Capture the role ARN, as it will be used later when defining the crawler.
  10. Then, let's attach the required policies to this role. For this role, we want to allocate two policies: the AWSGlueServiceRole policy (this is managed by AWS) and the DataLakeReadAccess policy we created earlier. To attach policies to the IAM role we will use the iam attach-role-policy command. Let's start with the AWSGlueServiceRole policy:
    % aws iam attach-role-policy --role-name GlueCrawlerRole --policy-arn arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole 
  11. Then we will attach the DataLakeReadAccess policy. We will need the policy ARN that we captured earlier. The policy ARN should look like the following line:
    arn:aws:iam::<accountid>:policy/DataLakeReadAccess

    And the command should look like the following:

    % aws iam attach-role-policy --role-name GlueCrawlerRole --policy-arn arn:aws:iam::<ACCOUNT-ID>:policy/DataLakeReadAccess
  12. Now, let's create the AWS Glue crawler. For this purpose, we will use the glue create-crawler CLI command. Make sure you replace the role ARN and the data lake location with the values for your environment:
    % aws glue create-crawler --name qs-book-crawler --role arn:aws:iam::xxxxxxxxxxxx:role/GlueCrawlerRole --database-name my-data-lake-db --targets "{\"S3Targets\":[{\"Path\":\"s3://data-lake-xxxxxxxxxx/yellowtrips\"}]}" --region us-east-1
  13. Then, just start the crawler using the glue start-crawler command:
    % aws glue start-crawler --name qs-book-crawler --region us-east-1

    After 1-2 minutes, the Glue crawler should populate the database.

  14. We can confirm this by calling the glue get-tables cli command:
    % aws glue get-tables --database-name my-data-lake-db
  15. You can view the Catalog from the AWS Console. Log in to the AWS Console and navigate to AWS Glue.
  16. Then on the left-hand side menu, under Data Catalog, choose Databases and then find my-data-lake-db. Then click on View tables under my-data-lake-db. It should look like the following screenshot:
Figure 1.5 – Glue console

Figure 1.5 – Glue console

Tip

You can click the checkbox to select the table and then, under Action, you can choose Preview Data. This will open the Amazon Athena console and run an Athena query that returns 10 values from the table.

Step 4 – creating the data warehouse

Let's create our data warehouse next.

To create the data warehouse, we will use the redshift create-cluster CLI command, or you can use the AWS Console:

%aws redshift create-cluster --node-type dc2.large --number-of-nodes 2 --master-username admin --master-user-password R3dsh1ft --cluster-identifier mycluster --region us-east-1

This command should give a response with the cluster metadata. After a few minutes, our cluster will be up and running.

Note

This command will create a Redshift cluster with a public IP address. This is something that should be avoided in real-world scenarios. The instructions provided are oversimplified for the purposes of this tutorial as this book is not focused on Amazon Redshift.

Step 5 – loading the data into the data warehouse

First, let's create an IAM role that we will assign to the Redshift cluster. We will use this role when using the Redshift Spectrum feature to query data in S3. We want the cluster to be able to write and read to our S3 location. We also want the cluster to be able to have read access to the Glue Catalog:

  1. Similarly to what we did earlier, we will create the following role policy document to allow the role to be assumed by the Redshift service:
    {
                "Version": "2012-10-17", 
                "Statement": [
                    {
                        "Action": "sts:AssumeRole", 
                        "Effect": "Allow", 
                        "Principal": {
                            "Service": "redshift.amazonaws.com"
                        }
                    }
                ]
            }
  2. Then copy this JSON object into a policy document:
    % vim role-policy-redshift
  3. Now, let's create the role using the iam create-role command:
    % aws iam create-role --role-name RedshiftSpectrumRole --assume-role-policy-document file://role-policy-redshift

    Note the role ARN, as we will use it later to attach it to the cluster.

  4. Next, we need to give the desired permissions by attaching the correct policies. This time, for simplicity, we will just attach two AWS managed policies. These policies are overly permissive, and normally we would attach policies with narrower permissions, as we did for the Glue crawlers in Step 3. Let's attach AWSFullS3Access and AWSFullGlueAccess:
    % aws iam attach-role-policy --role-name RedshiftSpectrumRole --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess 
    % aws iam attach-role-policy --role-name RedshiftSpectrumRole --policy-arn arn:aws:iam::aws:policy/AWSGlueConsoleFullAccess
  5. Next, we will attach this role to our cluster using the redshift modify-cluster-iam-roles CLI command. Note that you need to update the role ARN with the value from your environment:
    % aws redshift modify-cluster-iam-roles --cluster-identifier mycluster --add-iam-roles arn:aws:iam::<ACCOUNT-ID>:role/RedshiftSpectrumRole --region us-east-1
  6. The cluster change will take a few minutes to be applied. After the change is applied, the cluster will be ready to fetch data from the S3 data lake. To connect to the cluster, we will use the built-in query editor found in the AWS Management Console. To find the editor, navigate to the Redshift console, and see the left-hand side menu. The editor will need to establish a connection. Make sure you select the cluster we created earlier, and type dev as the database name and admin as the username.

    Note

    We didn't set a database name earlier. Redshift uses dev as the default value.

  7. In the editor page, we will need to create a table to store the data. Let's name the table yellowtrips_3mo, as we will only store 3 months' worth of data:
    create table yellowtrips_3mo
    (vendorid varchar(10),
    tpep_pickup_datetime datetime,
    tpep_dropoff_datetime datetime,
    passenger_count int,
    trip_distance float,
    ratecodeid varchar(10),
    store_and_fwd_flag char(1),
    pulocationid varchar(10),
    dolocationid varchar(10),
    payment_type varchar(10),
    fare_amount float,
    extra float,
    mta_tax float,
    tip_amount float,
    tolls_amount float,
    improvement_surcharge float,
    total_amount float,
    congestion_surcharge float);
  8. Then, let's copy 3 months' worth of data into the data warehouse. Let's use the COPY command, as follows:
    copy yellowtrips_3mo from 
    's3://data-lake-afo59dkg84/yellowtrips/yellow_tripdata_2020-10.csv'
    iam_role 'arn:aws:iam::xxxxxxxxxxxx:role/ RedshiftSpectrumRole' FORMAT AS CSV dateformat 'auto' ignoreheader 1;
    copy yellowtrips_3mo from 
    's3://data-lake-afo59dkg84/yellowtrips/yellow_tripdata_2020-11.csv'
    iam_role 'arn:aws:iam::xxxxxxxxxxxx:role/ RedshiftSpectrumRole' FORMAT AS CSV dateformat 'auto' ignoreheader 1;
    copy yellowtrips_3mo from 
    's3://data-lake-afo59dkg84/yellowtrips/yellow_tripdata_2020-12.csv'
    iam_role 'arn:aws:iam::xxxxxxxxxxxx:role/ RedshiftSpectrumRole' FORMAT AS CSV dateformat 'auto' ignoreheader 1;
  9. At this stage, we have a data lake with 12 months' worth of data and a data warehouse that contains only the most recent data (3 months). One of the characteristics of the Lake House architecture is that it allows its users to query the data lake from the data warehouse. Feel free to query the data and start getting an understanding of the dataset. Let's create the external schema so that we can enable the Spectrum feature. Use the following command in your Redshift editor. Replace the role ARN with the values from your environment:
    create external schema spectrum_schema from data catalog 
    database 'my-data-lake-db' 
    iam_role 'arn:aws:iam::xxxxxxxxxxxx:role/RedshiftSpectrumRole'
    create external database if not exists;
  10. Let's just compare the size of the two tables using a simple count(*) query:
    select count(*) from public.yellowtrips_3mo;
    select count(*) from spectrum_schema.yellowtrips;

    The first query will run against the recent data in the data warehouse. The second will run against the first query using the Spectrum nodes using the data in the data lake. As expected, the number of records in the data lake should be much higher than the number of records in the data warehouse. Specifically, the query result was 24,648,499 for the year 2020 and 4,652,013 records for the last 3 months of 2020.

    Note

    The Spectrum queries use the Spectrum nodes and are charged separately from the Redshift cluster. Every query incurs an added cost based on the data it needs to scan. Refer to the AWS pricing for details.

Feel free to experiment with the data and trigger a few queries to understand the dataset. When you finish with the Redshift cluster, you can pause the cluster so that you stop the on-demand billing. Once the cluster is paused you will only pay for the cluster storage.

Summary

Congratulations, you have reached the end of the first chapter! By now, you should have a good understanding of the AWS analytics ecosystem and its data lake and data warehousing options. In this chapter, we discussed in detail the key differences between data warehouses and data lakes. We also discussed the Lake House architecture on AWS, and we looked at its main components in more detail. Finally, during the step-by-step section in this chapter, you had a chance to create a data lake and a data warehouse from scratch, and you loaded an open dataset for further analysis later on. We also defined Spectrum tables and queried the data lake directly for the data warehouse.

In the next chapter, we will discuss the basic concepts of Amazon QuickSight, understand its main benefits, and learn how to set up a QuickSight account.

Questions

  1. What is the difference between data lakes and data warehouses?
  2. What is schema on read and what is schema on write?
  3. How can we identify the schema of new data in a data lake on AWS?
  4. Which AWS storage service is ideal for a data lake on AWS?
  5. What data is better served from the data lake in the Lake House architecture?
  6. How do Redshift Spectrum tables differ from Redshift tables?

Further reading

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Explore Amazon QuickSight, manage data sources, and build and share dashboards
  • Learn best practices from an AWS certified big data solutions architect
  • Manage and monitor dashboards using the QuickSight API and other AWS services such as Amazon CloudTrail

Description

Amazon Quicksight is an exciting new visualization that rivals PowerBI and Tableau, bringing several exciting features to the table – but sadly, there aren’t many resources out there that can help you learn the ropes. This book seeks to remedy that with the help of an AWS-certified expert who will help you leverage its full capabilities. After learning QuickSight’s fundamental concepts and how to configure data sources, you’ll be introduced to the main analysis-building functionality of QuickSight to develop visuals and dashboards, and explore how to develop and share interactive dashboards with parameters and on-screen controls. You’ll dive into advanced filtering options with URL actions before learning how to set up alerts and scheduled reports. Next, you’ll familiarize yourself with the types of insights before getting to grips with adding ML insights such as forecasting capabilities, analyzing time series data, adding narratives, and outlier detection to your dashboards. You’ll also explore patterns to automate operations and look closer into the API actions that allow us to control settings. Finally, you’ll learn advanced topics such as embedded dashboards and multitenancy. By the end of this book, you’ll be well-versed with QuickSight’s BI and analytics functionalities that will help you create BI apps with ML capabilities.

Who is this book for?

This book is for business intelligence (BI) developers and data analysts who are looking to create interactive dashboards using data from Lake House on AWS with Amazon QuickSight. It will also be useful for anyone who wants to learn Amazon QuickSight in depth using practical, up-to-date examples. You will need to be familiar with general data visualization concepts before you get started with this book, however, no prior experience with Amazon QuickSight is required.

What you will learn

  • Understand the wider AWS analytics ecosystem and how QuickSight fits within it
  • Set up and configure data sources with Amazon QuickSight
  • Include custom controls and add interactivity to your BI application using parameters
  • Add ML insights such as forecasting, anomaly detection, and narratives
  • Explore patterns to automate operations using QuickSight APIs
  • Create interactive dashboards and storytelling with Amazon QuickSight
  • Design an embedded multi-tenant analytics architecture
  • Focus on data permissions and how to manage Amazon QuickSight operations
Estimated delivery fee Deliver to Slovakia

Premium delivery 7 - 10 business days

€25.95
(Includes tracking information)

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Jan 28, 2022
Length: 242 pages
Edition : 1st
Language : English
ISBN-13 : 9781801079297
Vendor :
Amazon
Category :
Languages :

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Product feature icon AI Assistant (beta) to help accelerate your learning
OR
Modal Close icon
Payment Processing...
tick Completed

Shipping Address

Billing Address

Shipping Methods
Estimated delivery fee Deliver to Slovakia

Premium delivery 7 - 10 business days

€25.95
(Includes tracking information)

Product Details

Publication date : Jan 28, 2022
Length: 242 pages
Edition : 1st
Language : English
ISBN-13 : 9781801079297
Vendor :
Amazon
Category :
Languages :

Packt Subscriptions

See our plans and pricing
Modal Close icon
€18.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
€189.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts
€264.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total 120.97
Serverless Analytics with Amazon Athena
€41.99
Machine Learning with Amazon SageMaker Cookbook
€41.99
Actionable Insights with Amazon QuickSight
€36.99
Total 120.97 Stars icon
Banner background image

Table of Contents

14 Chapters
Section 1: Introduction to Amazon QuickSight and the AWS Analytics Ecosystem Chevron down icon Chevron up icon
Chapter 1: Introducing the AWS Analytics Ecosystem Chevron down icon Chevron up icon
Chapter 2: Introduction to Amazon QuickSight Chevron down icon Chevron up icon
Chapter 3: Preparing Data with Amazon QuickSight Chevron down icon Chevron up icon
Chapter 4: Developing Visuals and Dashboards Chevron down icon Chevron up icon
Section 2: Advanced Dashboarding and Insights Chevron down icon Chevron up icon
Chapter 5: Building Interactive Dashboards Chevron down icon Chevron up icon
Chapter 6: Working with ML Capabilities and Insights Chevron down icon Chevron up icon
Chapter 7: Understanding Embedded Analytics Chevron down icon Chevron up icon
Section 3: Advanced Topics and Management Chevron down icon Chevron up icon
Chapter 8: Understanding the QuickSight API Chevron down icon Chevron up icon
Chapter 9: Managing QuickSight Permissions and Usage Chevron down icon Chevron up icon
Chapter 10: Multitenancy in Amazon QuickSight Chevron down icon Chevron up icon
Other Books You May Enjoy Chevron down icon Chevron up icon

Customer reviews

Rating distribution
Full star icon Full star icon Full star icon Full star icon Half star icon 4.8
(5 Ratings)
5 star 80%
4 star 20%
3 star 0%
2 star 0%
1 star 0%
Apostolos Grevenitis Feb 01, 2022
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Pros:- Clear and concise messaging.- Builds knowledge block-by-block without rushing into concepts.- Plenty of hands-on examples and labs to supplement theoretical concepts (best way to learn is to get your hands dirty imo).- Provides holistic view of the AWS Analytics ecosystem + relevant use cases and makes it easy to understand where QuickSight falls and how it integrates with the rest of the AWS Analytics services.- Caters to AWS novices and more advanced users alike.Cons:- Black & white charts and images take from the paperback reader experience, however they are available through the publisher's website.
Amazon Verified review Amazon
Poonam Nair Jan 31, 2022
Full star icon Full star icon Full star icon Full star icon Full star icon 5
True to this books title, the book helped a novice like myself gain more insight in understanding QuickSight and to be able to see its use & importance as a comprehensive guide that aids with building detailed and specific data analytics solutions. This book is definitely a thorough guide that will aid me with future projects at work.
Amazon Verified review Amazon
MuahAmazon Customer Dec 17, 2021
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Actionable Insights with Amazon QuickSight is truly a workbook that takes you through the entire life cycle management of data analytics with a QuickSight lens. If you are looking for a comprehensive study of QuickSight, this text is your one-stop text.This book is NOT about how you create visualizations (although there are lots of visualizations!). It provides answers to the "WHYs" if going deep is your cup of tea; and provides prescriptions about the "WHENs".If you don't have any analytics experience, no problem. I was able to learn about data lakes and other AWS services that support analytics through many hands-on examples which I was able to immediately leverage in my work.
Amazon Verified review Amazon
Abhi Jan 28, 2022
Full star icon Full star icon Full star icon Full star icon Full star icon 5
This book by Manos Samatas is an excellent resource to gain knowledge on Amazon QuickSight. It is quite extensive and covers a wide range of features that are available, including several advanced BI topics for enterprise scale implementations. A nice read and useful for everyone who are in BI analytics and want to understand power of Amazon QuickSight.
Amazon Verified review Amazon
P. Jan 31, 2022
Full star icon Full star icon Full star icon Full star icon Empty star icon 4
Actionable Insights with Amazon Quicksight is an A to Z guide for Quicksight tool. Highly recommended for developers and analysts.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is the delivery time and cost of print book? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela
What is custom duty/charge? Chevron down icon Chevron up icon

Customs duty are charges levied on goods when they cross international borders. It is a tax that is imposed on imported goods. These duties are charged by special authorities and bodies created by local governments and are meant to protect local industries, economies, and businesses.

Do I have to pay customs charges for the print book order? Chevron down icon Chevron up icon

The orders shipped to the countries that are listed under EU27 will not bear custom charges. They are paid by Packt as part of the order.

List of EU27 countries: www.gov.uk/eu-eea:

A custom duty or localized taxes may be applicable on the shipment and would be charged by the recipient country outside of the EU27 which should be paid by the customer and these duties are not included in the shipping charges been charged on the order.

How do I know my custom duty charges? Chevron down icon Chevron up icon

The amount of duty payable varies greatly depending on the imported goods, the country of origin and several other factors like the total invoice amount or dimensions like weight, and other such criteria applicable in your country.

For example:

  • If you live in Mexico, and the declared value of your ordered items is over $ 50, for you to receive a package, you will have to pay additional import tax of 19% which will be $ 9.50 to the courier service.
  • Whereas if you live in Turkey, and the declared value of your ordered items is over € 22, for you to receive a package, you will have to pay additional import tax of 18% which will be € 3.96 to the courier service.
How can I cancel my order? Chevron down icon Chevron up icon

Cancellation Policy for Published Printed Books:

You can cancel any order within 1 hour of placing the order. Simply contact [email protected] with your order details or payment transaction id. If your order has already started the shipment process, we will do our best to stop it. However, if it is already on the way to you then when you receive it, you can contact us at [email protected] using the returns and refund process.

Please understand that Packt Publishing cannot provide refunds or cancel any order except for the cases described in our Return Policy (i.e. Packt Publishing agrees to replace your printed book because it arrives damaged or material defect in book), Packt Publishing will not accept returns.

What is your returns and refunds policy? Chevron down icon Chevron up icon

Return Policy:

We want you to be happy with your purchase from Packtpub.com. We will not hassle you with returning print books to us. If the print book you receive from us is incorrect, damaged, doesn't work or is unacceptably late, please contact Customer Relations Team on [email protected] with the order number and issue details as explained below:

  1. If you ordered (eBook, Video or Print Book) incorrectly or accidentally, please contact Customer Relations Team on [email protected] within one hour of placing the order and we will replace/refund you the item cost.
  2. Sadly, if your eBook or Video file is faulty or a fault occurs during the eBook or Video being made available to you, i.e. during download then you should contact Customer Relations Team within 14 days of purchase on [email protected] who will be able to resolve this issue for you.
  3. You will have a choice of replacement or refund of the problem items.(damaged, defective or incorrect)
  4. Once Customer Care Team confirms that you will be refunded, you should receive the refund within 10 to 12 working days.
  5. If you are only requesting a refund of one book from a multiple order, then we will refund you the appropriate single item.
  6. Where the items were shipped under a free shipping offer, there will be no shipping costs to refund.

On the off chance your printed book arrives damaged, with book material defect, contact our Customer Relation Team on [email protected] within 14 days of receipt of the book with appropriate evidence of damage and we will work with you to secure a replacement copy, if necessary. Please note that each printed book you order from us is individually made by Packt's professional book-printing partner which is on a print-on-demand basis.

What tax is charged? Chevron down icon Chevron up icon

Currently, no tax is charged on the purchase of any print book (subject to change based on the laws and regulations). A localized VAT fee is charged only to our European and UK customers on eBooks, Video and subscriptions that they buy. GST is charged to Indian customers for eBooks and video purchases.

What payment methods can I use? Chevron down icon Chevron up icon

You can pay with the following card types:

  1. Visa Debit
  2. Visa Credit
  3. MasterCard
  4. PayPal
What is the delivery time and cost of print books? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela