You're reading from Getting Started with Amazon SageMaker Studio Learn to build end-to-end machine learning projects in the SageMaker machine learning IDE

Product type Paperback

Published in Mar 2022

Publisher Packt

ISBN-13 9781801070157

Length 326 pages

Edition 1st Edition

Languages

Python

Concepts

Machine Learning

Author (1):

Michael Hsieh

View More author details

Table of Contents (16) Chapters

Preface

1. Part 1 – Introduction to Machine Learning on Amazon SageMaker Studio

2. Chapter 1: Machine Learning and Its Life Cycle in the Cloud FREE CHAPTER

3. Chapter 2: Introducing Amazon SageMaker Studio

4. Part 2 – End-to-End Machine Learning Life Cycle with SageMaker Studio

5. Chapter 3: Data Preparation with SageMaker Data Wrangler

6. Chapter 4: Building a Feature Repository with SageMaker Feature Store

7. Chapter 5: Building and Training ML Models with SageMaker Studio IDE

8. Chapter 6: Detecting ML Bias and Explaining Models with SageMaker Clarify

9. Chapter 7: Hosting ML Models in the Cloud: Best Practices

10. Chapter 8: Jumpstarting ML with SageMaker JumpStart and Autopilot

11. Part 3 – The Production and Operation of Machine Learning with SageMaker Studio

12. Chapter 9: Training ML Models at Scale in SageMaker Studio

13. Chapter 10: Monitoring ML Models in Production with SageMaker Model Monitor

14. Chapter 11: Operationalize ML Projects with SageMaker Projects, Pipelines, and Model Registry

15. Other Books You May Enjoy

Exploring AWS essentials for ML

Amazon Web Services (AWS) offers cloud computing resources to developers of all kinds to create applications and solutions for their businesses. AWS manages the technology and infrastructure in a secure environment and a scalable fashion, taking away the undifferentiated heavy lifting of infrastructure management from developers. AWS provides a broad range of services, including ML, artificial intelligence, the internet of things, analytics, and application development tools. These are built on top of the following key areas – compute, storage, databases, and security. Before we start our journey with Amazon SageMaker Studio, which is one of the ML offerings from AWS, it is important to know the core services that are commonly used while developing your ML projects on Amazon SageMaker Studio.

Compute

For ML in the cloud, developers need computational resources in all aspects of the life cycle. Amazon Elastic Compute Cloud (Amazon EC2) is the most fundamental cloud computing environment for developers to process, train, and host ML models. Amazon EC2 provides a wide range of compute instance types for many purposes, such as compute-optimized instances for compute-intensive work, memory-optimized instances for applications that have a large memory footprint, and Graphics Processing Unit (GPU)-accelerated instances for deep learning training.

Amazon SageMaker also offers on-demand compute resources for ML developers to run processing, training, and model hosting. Amazon SageMaker's ML instances build on top of Amazon EC2 instances and equip the instances with a fully managed, optimized versions of popular ML frameworks such as TensorFlow, PyTorch, MXNet, and scikit-learn, which are optimized for Amazon EC2 compute instances. Developers do not need to manage the provisioning and patching of the ML instances, so they can focus on the ML life cycle.

Storage

While conducting an ML project, developers need to be able to access files, store codes, and store artifacts. Reliable storage is crucial to an ML project. AWS provides several types of storage options for ML development. Amazon Simple Storage Service (Amazon S3) and Amazon Elastic File System (Amazon EFS) are the two that are most relevant to the development of ML projects in Amazon SageMaker Studio.

Amazon S3 is an object storage service that allows developers to store any amount of data with high security, availability, and scalability. ML developers can store structured and unstructured data, and ML models with versioning on Amazon S3. Amazon S3 can also be used to build a data lake for analytics and to store backups and archives.

Amazon EFS provides a fully managed, serverless filesystem that allows developers to store and share files across users on the filesystem without any storage provisioning, as the filesystem increases and decreases its capacity automatically when you add or delete files. It is often used in a High-Performance Cluster (HPC) setting and applications where parallel or simultaneous data access across threads, processing tasks, compute instances, and users with high throughput are required. As Amazon SageMaker Studio embeds an Amazon EFS filesystem, each user on Amazon SageMaker Studio gets a home directory for storing and accessing data, codes, and notebooks.

Database and analytics

Besides storage options, where data is saved as a file or an object, AWS users can store and access data at a data point level using database services such as Amazon Relational Database Service (Amazon RDS) and Amazon DynamoDB. AWS Analytics services such as AWS Glue and Amazon Athena provide capabilities in storing, querying, and data processing that are critical in the early phase of the ML life cycle.

For an ML project, relational databases are a common source of data for modeling. Amazon RDS is a cost-efficient and scalable relational database service in the cloud. It offers six database engines, including open sourced PostgreSQL, MySQL, and MariaDB, and the Oracle and SQL Server commercial databases. Infrastructure provisioning and management are made easy with Amazon RDS.

Another popular database is NoSQL, which uses key-value pairs as the data structure. Unlike relational databases, stringent schema requirements for tables are not required in NoSQL databases. Users can input data with a flexible schema for each row without needing to change the schema. Amazon DynamoDB is a key-value and document database that is fully managed, serverless, and highly scalable.

AWS Glue is a data integration service that has several features to help developers discover and transform data from sources for analytics and ML. The AWS Glue Data Catalog offers a persistent metadata store as a central repository for all your data sources, such as tables in Amazon S3, Amazon RDS, and Amazon DynamoDB. Developers can view all their tables and metadata such as the schema and time of update in one place – AWS Glue Data Catalog. AWS Glue's ETL service helps streamline the extract, transform, and load steps right after data is discovered and cataloged in the AWS Glue Data Catalog.

Amazon Athena is an analytics service that gives developers an interactive and serverless query experience. As a serverless service, developers do not need to think about the infrastructure underneath but instead focus on their data queries. You can easily point Amazon Athena to your data in Amazon S3 with a schema definition to start querying. Amazon Athena integrates natively with the AWS Glue Data Catalog to allow you to quickly and easily query against your data from all sources and services. Amazon Athena is also heavily integrated into several aspects of Amazon SageMaker Studio, which we will talk about in more detail throughout this book.

Security

Security is job zero when you develop your applications, access data, and train ML models on AWS. The access and identity control aspect of the security is governed by the AWS Identity and Access Management (IAM) service. Any control over services, cloud resources, authentication, and authorization can be granularly managed by AWS IAM.

Key concepts in IAM are the IAM user, group, role, and policy. Each person who logs onto AWS would assume an IAM user. Each IAM user has a list of IAM policies attached that governs the resources and actions in AWS that this IAM user can command and access. An IAM user can also inherit IAM policies from that of an IAM group, a collection of users who have similar responsibilities. An IAM role is similar to an IAM user in that it has a set of permissions to access resources and to perform actions. An IAM role differs from an IAM user in that a role can be assumed by users, applications, or services. For example, you can create and assign an AWS service role to an application in the cloud to permit what services and resources this application can access. An IAM user who has permission to an application can securely execute the application without worrying that the application would reach out to unauthorized resources. More information can be found here: https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles.html.