Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Data Wrangling on AWS
Data Wrangling on AWS

Data Wrangling on AWS: Clean and organize complex data for analysis

Arrow left icon
Profile Icon Shukla Profile Icon Sam Palani Profile Icon Sankar M
Arrow right icon
$19.99 per month
Full star icon Full star icon Full star icon Full star icon Half star icon 4.9 (8 Ratings)
Paperback Jul 2023 420 pages 1st Edition
eBook
$21.99 $31.99
Paperback
$39.99
Subscription
Free Trial
Renews at $19.99p/m
Arrow left icon
Profile Icon Shukla Profile Icon Sam Palani Profile Icon Sankar M
Arrow right icon
$19.99 per month
Full star icon Full star icon Full star icon Full star icon Half star icon 4.9 (8 Ratings)
Paperback Jul 2023 420 pages 1st Edition
eBook
$21.99 $31.99
Paperback
$39.99
Subscription
Free Trial
Renews at $19.99p/m
eBook
$21.99 $31.99
Paperback
$39.99
Subscription
Free Trial
Renews at $19.99p/m

What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing
Table of content icon View table of contents Preview book icon Preview Book

Data Wrangling on AWS

Getting Started with Data Wrangling

In the introductory section of this book, we listed use cases regarding how organizations use data to bring value to customers. Apart from that, organizations collect a lot of other data so that they can understand the finances of customers, which helps them share it with stakeholders, including log data for security, system health checks, and customer data, which is required for working on use cases such as Customer 360s.

We talked about all these use cases and how collecting data from different data sources is required to solve them. However, from collecting data to solving these business use cases, one very important step is to clean the data. That is where data wrangling comes into the picture.

In this chapter, we are going to learn the basics of data wrangling and cover the following topics:

  • Introducing data wrangling
  • The steps involved in data wrangling
  • Best practices for data wrangling
  • Options available within Amazon Web Services (AWS) to perform data wrangling

Introducing data wrangling

For organizations to become data-driven to provide value to customers or make more informed business decisions, they need to collect a lot of data from different data sources such as clickstreams, log data, transactional systems, and flat files and store them in different data stores such as data lakes, databases, and data warehouses as raw data. Once this data is stored in different data stores, it needs to be cleansed, transformed, organized, and joined from different data sources to provide more meaningful information to downstream applications such as machine learning models to provide product recommendations or look for traffic conditions. Alternatively, it can be used by business or data analytics to extract meaningful business information:

Figure 1.1: Data pipeline

Figure 1.1: Data pipeline

The 80-20 rule of data analysis

When organizations collect data from different data sources, it is not of much use initially. It is estimated that data scientists spend about 80% of their time cleaning data. This means that only 20% of their time will be spent analyzing and creating insights from the data science process:

Figure 1.2: Work distribution of a data scientist

Figure 1.2: Work distribution of a data scientist

Now that we understand the basic concept of data wrangling, we’ll learn why it is essential, and the various benefits we get from it.

Advantages of data wrangling

If we go back to the analogy of oil, when we first extract it, it is in the form of crude oil, which is not of much use. To make it useful, it has to go through a refinery, where the crude oil is put in a distillation unit. In this distillation process, the liquids and vapors are separated into petroleum components called fractions according to their boiling points. Heavy fractions are on the bottom while light fractions are on the top, as seen here:

Figure 1.3: Crude oil processing

Figure 1.3: Crude oil processing

The following figure showcases how oil processing correlates to the data wrangling process:

Figure 1.4: The data wrangling process

Figure 1.4: The data wrangling process

Data wrangling brings many advantages:

  • Enhanced data quality: Data wrangling helps improve the overall quality of the data. It involves identifying and handling missing values, outliers, inconsistencies, and errors. By addressing these issues, data wrangling ensures that the data used for analysis is accurate and reliable, leading to more robust and trustworthy results.
  • Improved data consistency: Raw data often comes from various sources or in different formats, resulting in inconsistencies in naming conventions, units of measurement, or data structure. Data wrangling allows you to standardize and harmonize the data, ensuring consistency across the dataset. Consistent data enables easier integration and comparison of information, facilitating effective analysis and interpretation.
  • Increased data completeness: Incomplete data can pose challenges during analysis and modeling. Data wrangling methods allow you to handle missing data by applying techniques such as imputation, where missing values are estimated or filled in based on existing information. By dealing with missing data appropriately, data wrangling helps ensure a more complete dataset, reducing potential biases and improving the accuracy of analyses.
  • Facilitates data integration: Organizations often have data spread across multiple systems and sources, making integration a complex task. Data wrangling helps in merging and integrating data from various sources, allowing analysts to work with a unified dataset. This integration facilitates a holistic view of the data, enabling comprehensive analyses and insights that might not be possible when working with fragmented data.
  • Streamlined data transformation: Data wrangling provides the tools and techniques to transform raw data into a format suitable for analysis. This transformation includes tasks such as data normalization, aggregation, filtering, and reformatting. By streamlining these processes, data wrangling simplifies the data preparation stage, saving time and effort for analysts and enabling them to focus more on the actual analysis and interpret the results.
  • Enables effective feature engineering: Feature engineering involves creating new derived variables or transforming existing variables to improve the performance of machine learning models. Data wrangling provides a foundation for feature engineering by preparing the data in a way that allows for meaningful transformations. By performing tasks such as scaling, encoding categorical variables, or creating interaction terms, data wrangling helps derive informative features that enhance the predictive power of models.
  • Supports data exploration and visualization: Data wrangling often involves exploratory data analysis (EDA), where analysts gain insights and understand patterns in the data before formal modeling. By cleaning and preparing the data, data wrangling enables effective data exploration, helping analysts uncover relationships, identify trends, and visualize the data using charts, graphs, or other visual representations. These exploratory steps are crucial for forming hypotheses, making data-driven decisions, and communicating insights effectively.

Now that we have learned about the advantages of data wrangling, let’s understand the steps involved in the data wrangling process.

The steps involved in data wrangling

Similar to crude oil, raw data has to go through multiple data wrangling steps to become meaningful. In this section, we are going to learn the six-step process involved in data wrangling:

  1. Data discovery
  2. Data structuring
  3. Data cleaning
  4. Data enrichment
  5. Data validation
  6. Data publishing

Before we begin, it’s important to understand these activities may or may not need to be followed sequentially, or in some cases, you may skip any of these steps.

Also, keep in mind that these steps are iterative and differ for different personas, such as data analysts, data scientists, and data engineers.

As an example, data discovery for data engineers may vary from what data discovery means for a data analyst or data scientist:

Figure 1.5: The steps of the data-wrangling process

Figure 1.5: The steps of the data-wrangling process

Let’s start learning about these steps in detail.

Data discovery

The first step of the data wrangling process is data discovery. This is one of the most important steps of data wrangling. In data discovery, we familiarize ourselves with the kind of data we have as raw data, what use case we are looking to solve with that data, what kind of relationships exist between the raw data, what the data format will look like, such as CSV or Parquet, what kind of tools are available for storing, transforming, and querying this data, and how we wish to organize this data, such as by folder structure, file size, partitions, and so on to make it easy to access.

Let’s understand this by looking at an example.

In this example, we will try to understand how data discovery varies based on the persona. Let’s assume we have two colleagues, James and Jean. James is a data engineer while Jean is a data analyst, and they both work for a car-selling company.

Jean is new to the organization and she is required to analyze car sales numbers for Southern California. She has reached out to James and asked him for data from the sales table from the production system.

Here is the data discovery process for Jane (a data analyst):

  1. Jane has to identify the data she needs to generate the sales report (for example, sales transaction data, vehicle details data, customer data, and so on).
  2. Jane has to find where the sales data resides (a database, file share, CRM, and so on).
  3. Jane has to identify how much data she needs (from the last 12 months, the last month, and so on).
  4. Jane has to identify what kind of tool she is going to use (Amazon QuickSight, Power BI, and so on).
  5. Jane has to identify the format she needs the data to be in so that it works with the tools she has.
  6. Jane has to identify where she is looking to store this data – in a data lake (Amazon S3), on her desktop, a file share, and sandbox environment, and so on.

Here is the data discovery process for James (a data engineer):

  1. Which system has requested data? For example, Amazon RDS, Salesforce CRM, Production SFTP location, and so on.
  2. How will the data be extracted? For example, using services such as Amazon DMS or AWS Glue or writing a script.
  3. What will the schedule look like? Daily, weekly, or monthly?
  4. What will the file format look like? For example, CSV, Parquet, orc, and so on.
  5. How will the data be stored in the provided store?

Data structuring

To support existing and future business use cases to serve its customers better, the organization must collect unprecedented amounts of data from different data sources and in different varieties. In modern data architecture, most of the time, the data is stored in data lakes since a data lake allows you to store all kinds of data files, whether it is structured data, unstructured data, images, audio, video, or something else, and it will be of different shapes and sizes in its raw form. When data is in its raw form, it lacks a definitive structure, which is required for it to be stored in databases or data warehouses or used to build analytics or machine learning models. At this point, it is not optimized for cost and performance.

In addition, when you work with streaming data such as clickstreams and log analytics, not all the data fields (columns) are used in analytics.

At this stage of data wrangling, we try to optimize the raw dataset for cost and performance benefits by performing partitioning and converting file types (for example, CSV into Parquet).

Once again, let’s consider our friends James and Jean to understand this.

For Jean, the data analyst, data structuring means that she is looking to do direct queries or store data in a memory store of a BI tool, in the case of Amazon QuickSight called the SPICE layer, which provides faster access to data.

For James, the data engineer, when he is extracting data from a production system and looking to store it in a data lake such as Amazon S3, he must consider what the file format will look like. He can partition it by geographical regions, such as county, state, or region, or by date – for example, year=YYYY, month=MM, and day=DD.

Data cleaning

The next step of the data wrangling process is data cleaning. The previous two steps give us an idea of how the data looks and how it is stored. In the data cleaning step, we start working with raw data to make it meaningful so that we can define future use cases.

In the data cleaning step, we try to make data meaningful by doing the following:

  • Removing unwanted columns, duplicate values, and filling null value columns to improve the data’s readiness
  • Performing data validation such as identifying missing values for mandatory columns such as First Name, Last Name, SSN, Phone No., and so on
  • Validating or fixing data type for better optimization of storage and performance
  • Identifying and fixing outliers
  • Removing garbage data or unwanted values, such as special characters

Both James and Jane can perform similar data cleaning tasks; however, their scale might vary. For James, these tasks must be done for the entire dataset. For Jane, they may only have to perform them on the data from Southern California, and granularity might vary as well. For James, maybe it is only limited to regions such as Southern California, Northern California, and so on, while for Jean, it might be city level or even ZIP code.

Data enrichment

Up until the data cleaning step, we were primarily working on single data sources and making them meaningful for future use. However, in the real world, most of the time, data is fragmented and stored in multiple disparate data stores, and to support use cases such as building personalization or recommendation solutions or building Customer 360s or log forensics, we need to join the data from different data stores.

For example, to build a Customer 360 solution, you need data from the Customer Relationship Manager (CRM) systems, clickstream logs, relational databases, and so on.

So, in the data enrichment step, we build the process that will enhance the raw data with relevant data obtained from different sources.

Data validation

There is a very interesting term in computer science called garbage in, garbage out (GIGO). GIGO is the concept that flawed or defective (garbage) input data produces defective output.

In other words, the quality of the output is determined by the quality of the input. So, if we provide bad data as input, we will get inaccurate results.

In the data validation step, we address this issue by performing various data quality checks:

  • Business validation of data accuracy
  • Validate data security
  • Validate result consistency across the entire dataset
  • Validate data quality by validating data quality checks such as the following:
    • Number of records
    • Duplicate values
    • Missing values
    • Outliers
    • Distinct values
    • Unique values
    • Correlation

There is a lot of overlap between data cleaning and data validation and yes, there are a lot of similarities between these two processes. However, data validation is done on the resulting dataset, while data cleaning is primarily done on the raw dataset.

Data publishing

After completing all the data wrangling steps, the data is ready to be used for analytics so that it can solve business problems.

So, the final step is to publish the data to the end user with the required access and permission.

In this step, we primarily concentrate on how the data is being exposed to the end user and where the final data gets stored – that is, in a relational database, a data warehouse, curated or user zones in a data lake, or through the Secure File Transfer Protocol (SFTP).

The choice of data storage depends on the tool through which the end user is looking to access the data. For example, if the end user is looking to access data through BI tools such as Amazon QuickSight, Power BI, Informatica, and so on, a relational data store will be an ideal choice. If it is accessed by a data scientist, ideally, it should be stored in an object store.

We will learn about the different kinds of data stores we can use to store raw and wrangled data later in this book.

In this section, we learned about the various steps of the data wrangling process through our friends James and Jean and how these steps may or may not vary based on personas. Now, let’s understand the best practices for data wrangling.

Best practices for data wrangling

There are many ways and tools available to perform data wrangling, depending on how data wrangling is performed and by whom. For example, if you are working on real-time use cases such as providing product recommendations or fraud detection, your choice of tool and process for performing data wrangling will be a lot different compared to when you are looking to build a business intelligence (BI) dashboard to show sales numbers.

Regardless of the kind of use cases you are looking to solve, some standard best practices can be applied in each case that will help make your job easier as a data wrangler.

Identifying the business use case

It’s recommended that you decide which service or tool you are looking to use for data wrangling before you write a single line of code. It is super important to identify the business use case as this will set the stage for data wrangling processes and make the job of identifying the services you are looking to use easier. For example, if you have a business use case such as analyzing HR data for small organizations where you just need to concatenate a few columns, remove a few columns, remove duplicates, remove NULL values, and so on from a small dataset that contains 10,000 records, and only a few users will be looking to access the wrangled data, then you don’t need to invest a ton of money to find a fancy data wrangling tool available on the market – you can simply use Excel sheets for your work.

However, when you have a business use case, such as processing claims data you receive from different partners where you need to work with semi-structured files such as JSON, or non-structured datasets such as XML files to extract only a few files’ data such as their claim ID and customer information, and you are looking to perform complex data wrangling processes such as joins, finding patterns using regex, and so on, then you should look to write scripts or subscribe to any enterprise-grade tool for your work.

Identifying the data source and bringing the right data

After identifying the business use case, it is important to identify which data sources are required to solve it. Identifying this source will help you choose what kind of services are required to bring the data, frequency, and end storage. For example, if you are looking to build a credit card fraud detection solution, you need to bring in credit card transaction data in real time; even cleaning and processing the data should be done in real time. Machine learning inference also needs to be run on real-time data.

Similarly, if you are building a sales dashboard, you may need to bring in data from a CRM system such as Salesforce or a transactional datastore such as Oracle, Microsoft SQL Server, and so on.

After identifying the right data sources, it is important to bring in the right data from these data sources as it will help you solve the business use cases and make the data wrangling process easy.

Identifying your audience

When you perform data wrangling, one important aspect is to identify your audience. Knowing your audience will help you identify what kind of data they are looking to consume. For example, marketing teams may have different data wrangling requirements compared to data science teams or business executives.

This will also give you an idea of where you are looking to present the data – for example, a data scientist team may need data in an object store such as Amazon S3, business analysts may need data in flat files such as CSV, BI developers may need data in a transactional data store, and business users may need data in applications.

With that, we have covered the best practices of data wrangling. Next, we will explore the different options that are available within AWS to perform data wrangling.

Options available for data wrangling on AWS

Depending on customer needs, data sources, and team expertise, AWS provides multiple options for data wrangling. In this section, we will cover the most common options that are available with AWS.

AWS Glue DataBrew

Released in 2020, AWS Glue DataBrew is a visual data preparation tool that makes it easy for you to clean and normalize data so that you can prepare it for analytics and machine learning. The visual UI provided by this service allows data analysts with no coding or scripting experience to accomplish all aspects of data wrangling. It comes with a rich set of common pre-built data transformation actions that can simplify these data wrangling activities. Similar to any Software as a service (SaaS) (https://en.wikipedia.org/wiki/Software_as_a_service), customers can start using the web UI without the need to provision any servers and only need to pay for the resources they use.

SageMaker Data Wrangler

Similar to AWS Glue DataBrew, AWS also provides SageMaker Data Wrangler, a web UI-based data wrangling service catered more toward data scientists. If the primary use case is around building a machine learning pipeline, SageMaker Data Wrangler should be the preference. It integrates directly with SageMaker Studio, where data that’s been prepared using SageMaker Data Wrangler can be fed into a data pipeline to build, train, and deploy machine learning models. It comes with pre-configured data transformations to impute missing data with means or medians, one-hot encoding, and time series-specific transformers that are required for preparing data for machine learning.

AWS SDK for pandas

For customers with a strong data integration team with coding and scripting experience, AWS SDK for pandas (https://github.com/aws/aws-sdk-pandas) is a great option. Built on top of other open source projects, it offers abstracted functions for executing typical data wrangling tasks such as loading/unloading data from various databases, data warehouses, and object data stores such as Amazon S3. AWS SDK for pandas simplifies integration with common AWS services such as Athena, Glue, Redshift, Timestream, OpenSearch, Neptune, DynamoDB, and S3. It also supports common databases such as MySQL and SQL Server.

Summary

In this chapter, we learned about the basics of data wrangling, why it is important, the steps and best practices of data wrangling, and how the data wrangling steps vary based on persona. We also talked about the different data wrangling options available in AWS.

In the upcoming chapters, we will dive deep into each of these options and learn how to use these services to perform data wrangling.

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Execute extract, transform, and load (ETL) tasks on data lakes, data warehouses, and databases
  • Implement effective Pandas data operation with data wrangler
  • Integrate pipelines with AWS data services

Description

Data wrangling is the process of cleaning, transforming, and organizing raw, messy, or unstructured data into a structured format. It involves processes such as data cleaning, data integration, data transformation, and data enrichment to ensure that the data is accurate, consistent, and suitable for analysis. Data Wrangling on AWS equips you with the knowledge to reap the full potential of AWS data wrangling tools. First, you’ll be introduced to data wrangling on AWS and will be familiarized with data wrangling services available in AWS. You’ll understand how to work with AWS Glue DataBrew, AWS data wrangler, and AWS Sagemaker. Next, you’ll discover other AWS services like Amazon S3, Redshift, Athena, and Quicksight. Additionally, you’ll explore advanced topics such as performing Pandas data operation with AWS data wrangler, optimizing ML data with AWS SageMaker, building the data warehouse with Glue DataBrew, along with security and monitoring aspects. By the end of this book, you’ll be well-equipped to perform data wrangling using AWS services.

Who is this book for?

This book is for data engineers, data scientists, and business data analysts looking to explore the capabilities, tools, and services of data wrangling on AWS for their ETL tasks. Basic knowledge of Python, Pandas, and a familiarity with AWS tools such as AWS Glue, Amazon Athena is required to get the most out of this book.

What you will learn

  • Explore how to write simple to complex transformations using AWS data wrangler
  • Use abstracted functions to extract and load data from and into AWS datastores
  • Configure AWS Glue DataBrew for data wrangling
  • Develop data pipelines using AWS data wrangler
  • Integrate AWS security features into Data Wrangler using identity and access management (IAM)
  • Optimize your data with AWS SageMaker

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Jul 31, 2023
Length: 420 pages
Edition : 1st
Language : English
ISBN-13 : 9781801810906
Vendor :
Amazon
Category :
Languages :
Concepts :
Tools :

What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing

Product Details

Publication date : Jul 31, 2023
Length: 420 pages
Edition : 1st
Language : English
ISBN-13 : 9781801810906
Vendor :
Amazon
Category :
Languages :
Concepts :
Tools :

Packt Subscriptions

See our plans and pricing
Modal Close icon
$19.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
$199.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts
$279.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total $ 141.97
Data Wrangling on AWS
$39.99
AWS Observability Handbook
$49.99
Data Engineering with AWS
$51.99
Total $ 141.97 Stars icon
Banner background image

Table of Contents

18 Chapters
Part 1:Unleashing Data Wrangling with AWS Chevron down icon Chevron up icon
Chapter 1: Getting Started with Data Wrangling Chevron down icon Chevron up icon
Part 2:Data Wrangling with AWS Tools Chevron down icon Chevron up icon
Chapter 2: Introduction to AWS Glue DataBrew Chevron down icon Chevron up icon
Chapter 3: Introducing AWS SDK for pandas Chevron down icon Chevron up icon
Chapter 4: Introduction to SageMaker Data Wrangler Chevron down icon Chevron up icon
Part 3:AWS Data Management and Analysis Chevron down icon Chevron up icon
Chapter 5: Working with Amazon S3 Chevron down icon Chevron up icon
Chapter 6: Working with AWS Glue Chevron down icon Chevron up icon
Chapter 7: Working with Athena Chevron down icon Chevron up icon
Chapter 8: Working with QuickSight Chevron down icon Chevron up icon
Part 4:Advanced Data Manipulation and ML Data Optimization Chevron down icon Chevron up icon
Chapter 9: Building an End-to-End Data-Wrangling Pipeline with AWS SDK for Pandas Chevron down icon Chevron up icon
Chapter 10: Data Processing for Machine Learning with SageMaker Data Wrangler Chevron down icon Chevron up icon
Part 5:Ensuring Data Lake Security and Monitoring Chevron down icon Chevron up icon
Chapter 11: Data Lake Security and Monitoring Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon
Other Books You May Enjoy Chevron down icon Chevron up icon

Customer reviews

Top Reviews
Rating distribution
Full star icon Full star icon Full star icon Full star icon Half star icon 4.9
(8 Ratings)
5 star 87.5%
4 star 12.5%
3 star 0%
2 star 0%
1 star 0%
Filter icon Filter
Top Reviews

Filter reviews by




Miroslaw Banasiak Feb 15, 2024
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Feefo Verified review Feefo
Gagan Brahmi Apr 12, 2024
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Very impressed with the breadth of topics this book covers. I have been in data space for more than decade and I throughly concur to all the aspects covered by this book when it comes to working with data on AWS. The book is useful for everyone including the beginners or seasoned professionals. I'd highly recommend this book to better understand the data manipulation techniques using AWS platform.
Amazon Verified review Amazon
Pavan Pusuluri Sep 28, 2023
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Data Wrangling on AWS" provides a comprehensive and insightful guide to the intricate world of data management within Amazon Web Services (AWS). This book demystifies the complexities of data wrangling, making it accessible to both newcomers and seasoned professionals.One of the standout features of the book is its ability to address a wide spectrum of users. Whether you're a data engineer, data scientist, or business analyst, the book caters to your specific needs by offering relevant insights and real-world scenarios.From extracting data from various sources to performing data cleansing and shaping operations, the book's step-by-step approach ensures that you're equipped to handle diverse data challenges.Whether you're aiming to refine your data management skills or seeking to unlock the power of AWS for your data driven endeavors, this book serves as an invaluable companion on your journey !
Amazon Verified review Amazon
Naresh Aug 22, 2023
Full star icon Full star icon Full star icon Full star icon Full star icon 5
This is one of the most precise and curated version of a book which deals with complex data sets. If you deal with data this book is the one to read and understand the concepts of data analysis in depth
Amazon Verified review Amazon
Om S Aug 11, 2023
Full star icon Full star icon Full star icon Full star icon Full star icon 5
"Data Wrangling on AWS" offers an insightful journey into the realm of data transformation, redefining raw and unruly data into organized, structured information. This review sheds light on the book's key features and invaluable insights!!Effortless ETL Mastery: The book's focus on extract, transform, and load (ETL) tasks across data lakes, warehouses, and databases unlocks a new level of proficiency in data handling.Pandas-Powered Data Operation: With a spotlight on Pandas data operation using data wrangler, readers are empowered to wield the prowess of Python's renowned data manipulation library for effective data wrangling.Seamless AWS Integration: The integration of AWS data services into pipelines is masterfully explained, enabling readers to harness the full potential of AWS in their data wrangling endeavors.The book dives deep into the art of data wrangling, illuminating the significance of cleaning, transforming, and structuring unstructured data. Guided by the book's comprehensive insights, readers are introduced to AWS Glue DataBrew, AWS data wrangler, AWS Sagemaker, and various other services within the AWS ecosystem.The exploration doesn't stop there; advanced topics, including performing Pandas data operations with AWS data wrangler, optimizing ML data using AWS SageMaker, and building data warehouses with Glue DataBrew, are thoroughly covered. Security and monitoring aspects are also thoughtfully addressed, ensuring a holistic understanding of data wrangling.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is included in a Packt subscription? Chevron down icon Chevron up icon

A subscription provides you with full access to view all Packt and licnesed content online, this includes exclusive access to Early Access titles. Depending on the tier chosen you can also earn credits and discounts to use for owning content

How can I cancel my subscription? Chevron down icon Chevron up icon

To cancel your subscription with us simply go to the account page - found in the top right of the page or at https://subscription.packtpub.com/my-account/subscription - From here you will see the ‘cancel subscription’ button in the grey box with your subscription information in.

What are credits? Chevron down icon Chevron up icon

Credits can be earned from reading 40 section of any title within the payment cycle - a month starting from the day of subscription payment. You also earn a Credit every month if you subscribe to our annual or 18 month plans. Credits can be used to buy books DRM free, the same way that you would pay for a book. Your credits can be found in the subscription homepage - subscription.packtpub.com - clicking on ‘the my’ library dropdown and selecting ‘credits’.

What happens if an Early Access Course is cancelled? Chevron down icon Chevron up icon

Projects are rarely cancelled, but sometimes it's unavoidable. If an Early Access course is cancelled or excessively delayed, you can exchange your purchase for another course. For further details, please contact us here.

Where can I send feedback about an Early Access title? Chevron down icon Chevron up icon

If you have any feedback about the product you're reading, or Early Access in general, then please fill out a contact form here and we'll make sure the feedback gets to the right team. 

Can I download the code files for Early Access titles? Chevron down icon Chevron up icon

We try to ensure that all books in Early Access have code available to use, download, and fork on GitHub. This helps us be more agile in the development of the book, and helps keep the often changing code base of new versions and new technologies as up to date as possible. Unfortunately, however, there will be rare cases when it is not possible for us to have downloadable code samples available until publication.

When we publish the book, the code files will also be available to download from the Packt website.

How accurate is the publication date? Chevron down icon Chevron up icon

The publication date is as accurate as we can be at any point in the project. Unfortunately, delays can happen. Often those delays are out of our control, such as changes to the technology code base or delays in the tech release. We do our best to give you an accurate estimate of the publication date at any given time, and as more chapters are delivered, the more accurate the delivery date will become.

How will I know when new chapters are ready? Chevron down icon Chevron up icon

We'll let you know every time there has been an update to a course that you've bought in Early Access. You'll get an email to let you know there has been a new chapter, or a change to a previous chapter. The new chapters are automatically added to your account, so you can also check back there any time you're ready and download or read them online.

I am a Packt subscriber, do I get Early Access? Chevron down icon Chevron up icon

Yes, all Early Access content is fully available through your subscription. You will need to have a paid for or active trial subscription in order to access all titles.

How is Early Access delivered? Chevron down icon Chevron up icon

Early Access is currently only available as a PDF or through our online reader. As we make changes or add new chapters, the files in your Packt account will be updated so you can download them again or view them online immediately.

How do I buy Early Access content? Chevron down icon Chevron up icon

Early Access is a way of us getting our content to you quicker, but the method of buying the Early Access course is still the same. Just find the course you want to buy, go through the check-out steps, and you’ll get a confirmation email from us with information and a link to the relevant Early Access courses.

What is Early Access? Chevron down icon Chevron up icon

Keeping up to date with the latest technology is difficult; new versions, new frameworks, new techniques. This feature gives you a head-start to our content, as it's being created. With Early Access you'll receive each chapter as it's written, and get regular updates throughout the product's development, as well as the final course as soon as it's ready.We created Early Access as a means of giving you the information you need, as soon as it's available. As we go through the process of developing a course, 99% of it can be ready but we can't publish until that last 1% falls in to place. Early Access helps to unlock the potential of our content early, to help you start your learning when you need it most. You not only get access to every chapter as it's delivered, edited, and updated, but you'll also get the finalized, DRM-free product to download in any format you want when it's published. As a member of Packt, you'll also be eligible for our exclusive offers, including a free course every day, and discounts on new and popular titles.