You're reading from Amazon SageMaker Best Practices Proven tips and tricks to build successful machine learning solutions on Amazon SageMaker

Product type Paperback

Published in Sep 2021

Publisher Packt

ISBN-13 9781801070522

Length 348 pages

Edition 1st Edition

Languages

Python

Tools

Amazon SimpleDB

Concepts

Machine Learning

Authors (3):

Randy DeFauw

Shelbee Eigenbrode

Sireesha Muppala

View More author details

Table of Contents (20) Chapters

Preface

1. Section 1: Processing Data at Scale

2. Chapter 1: Amazon SageMaker Overview FREE CHAPTER

3. Chapter 2: Data Science Environments

4. Chapter 3: Data Labeling with Amazon SageMaker Ground Truth

5. Chapter 4: Data Preparation at Scale Using Amazon SageMaker Data Wrangler and Processing

6. Chapter 5: Centralized Feature Repository with Amazon SageMaker Feature Store

7. Section 2: Model Training Challenges

8. Chapter 6: Training and Tuning at Scale

9. Chapter 7: Profile Training Jobs with Amazon SageMaker Debugger

10. Section 3: Manage and Monitor Models

11. Chapter 8: Managing Models at Scale Using a Model Registry

12. Chapter 9: Updating Production Models Using Amazon SageMaker Endpoint Production Variants

13. Chapter 10: Optimizing Model Hosting and Inference Costs

14. Chapter 11: Monitoring Production Models with Amazon SageMaker Model Monitor and Clarify

15. Section 4: Automate and Operationalize Machine Learning

16. Chapter 12: Machine Learning Automated Workflows

17. Chapter 13:Well-Architected Machine Learning with Amazon SageMaker

18. Chapter 14: Managing SageMaker Features across Accounts

19. Other Books You May Enjoy

Discussion of data preparation capabilities

In this section, we'll dive into SageMaker's data preparation and feature engineering capabilities. By the end of this section, you should understand when to use SageMaker Ground Truth, Data Wrangler, Processing, Feature Store, and Clarify.

SageMaker Ground Truth

Obtaining labeled data for classification, regression, and other tasks is often the biggest barrier to ML projects, as many companies have a lot of data but have not explicitly labeled it according to business properties such as anomalous and high lifetime value. SageMaker Ground Truth helps you systematically label data by defining a labeling workflow and assigning labeling tasks to a human workforce.

Over time, Ground Truth can learn how to label data automatically, while still sending low-confidence results to humans for review. For advanced datasets such as 3D point clouds, which represent data points like shape coordinates, Ground Truth offers assistive labeling features, such as adding bounding boxes to the middle frames of a sequence once you label the start and end frames. The following diagram shows an example of labels applied to a dataset:

Figure 1.4 – SageMaker Ground Truth showing the labels applied to sentiment reviews

The data is sourced from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences). To counteract individual worker bias or error, a data object can be sent to multiple workers. In this example, we only have one worker, so the confidence score is not used.

Note that you can also use Ground Truth in other phases of the ML life cycle; for example, you may use it to check the labels generated by a production model.

SageMaker Data Wrangler

Data Wrangler helps you understand your data and perform feature engineering. Data Wrangler works with data stored in S3 (optionally accessed via Athena) and Redshift and performs typical visualization and transformations, such as correlation plots and categorical encoding. You can combine a series of transformations into a data flow and export that flow into an MLOps pipeline. The following screenshot shows an example of Data Wrangler information for a dataset:

Figure 1.5 – Data Wrangler displaying summary table information regarding a dataset

You may also use Data Wrangler in the operations phase of the ML life cycle if you want to analyze the data coming into an ML model for production inference.

SageMaker Processing

SageMaker Processing jobs help you run data processing and feature engineering tasks on your datasets. By providing your own Docker image containing your code, or using a pre-built Spark or sklearn container, you can normalize and transform data to prepare your features. The following diagram shows the logical flow of a SageMaker Processing job:

Figure 1.6 – Conceptual overview of a Spark processing job. Spark jobs are particularly handy for processing larger datasets

You may also use processing jobs to evaluate the performance of ML models during the Model Training phase and to check data and model quality in the Model Operations phase.

SageMaker Feature Store

SageMaker Feature Store helps you organize and share your prepared features. Using a feature store improves quality and saves time by letting you reuse features rather than duplicate complex feature engineering code and computations that have already been done. Feature Store supports both batch and stream storage and retrieval. The following screenshot shows an example of feature group information:

Figure 1.7 – Feature Store showing a feature group with a set of related features

Feature Store also helps during the Model Operations phase, as you can quickly look up complex feature vectors to help obtain real-time predictions.

SageMaker Clarify

SageMaker Clarify helps you understand model behavior and calculate bias metrics from your model. It checks for imbalance in the dataset, models that give different results based on certain attributes, and bias that appears due to data drift. It can also use leading explainability algorithms such as SHAP to help you explain individual predictions to get a sense of which features drive model behavior. The following figure shows an example of class imbalance scores for a dataset, where we have many more samples from the Gift Card category than the other categories:

Figure 1.8 – Clarify showing class imbalance scores in a dataset. Class imbalance can lead to biased results in an ML model

Clarify can be used throughout the entire ML life cycle, but consider using it early in the life cycle to detect imbalanced data (datasets that have many examples of one class but few of another).

Now that we've introduced several SageMaker capabilities for data preparation, let's move on to model-building capabilities.