Setting up an Amazon SageMaker notebook instance
Experimentation is a key part of the ML process. Developers and data scientists use a collection of open source tools and libraries for data exploration and processing, and of course, to evaluate candidate algorithms. Installing and maintaining these tools takes a fair amount of time, which would probably be better spent on studying the ML problem itself!
In order to solve this problem, Amazon SageMaker makes it easy to fire up a notebook instance in minutes. A notebook instance is a fully managed Amazon EC2 instance that comes preinstalled with the most popular tools and libraries: Jupyter, Anaconda (and its conda
package manager), numpy
, pandas
, deep learning frameworks, and even NVIDIA GPU drivers.
Note:
If you're not familiar with S3 at all, please read the following documentation:https://docs.aws.amazon.com/AmazonS3/latest/dev/Welcome.html
Let's create one such instance using the AWS Console (https://console.aws.amazon.com/sagemaker/):
- In the Notebook section of the left-hand vertical menu, click on Notebook instances, as shown in the next screenshot:
Note:
The AWS console is a living thing. By the time you're reading this, some screens may have been updated. Also, you may notice small differences from one region to the next, as some features or instance types are not available there.
- Then, click on Create notebook instance. In the Notebook instance settings box, we need to enter a name, and select an instance type: as you can see in the drop-down list, SageMaker lets us pick from a very wide range of instance types. As you would expect, pricing varies according to the instance size, so please make sure you familiarize yourself with instance features and costs (https://aws.amazon.com/sagemaker/pricing/).
- We'll stick to
ml.t2.medium
for now. As a matter of fact, it's an excellent default choice if your notebooks only invoke SageMaker APIs that create fully managed infrastructure for training and deployment – no need for anything larger. If your workflow requires local data processing and model training, then feel free to scale up as needed.We can ignore Elastic Inference for now, it will be covered in Chapter 13, Optimizing Prediction Cost and Performance. Thus, your setup screen should look like the following screenshot:
- As you can see in the following screenshot, we could optionally apply a lifecycle configuration, a script that runs either when a notebook instance is created or restarted, in order to install additional libraries, clone repositories, and so on. We could also add additional storage (the default is set to 5 GB):
- In the Permissions and encryption section, we need to create an Amazon IAM role for the notebook instance: it will allow it to access storage in Amazon S3, to create Amazon SageMaker infrastructure, and so on.
Select Create a new role, which opens the following screen:
The only decision we have to make here is whether we want to allow our notebook instance to access specific Amazon S3 buckets. Let's select Any S3 bucket and click on Create role. This is the most flexible setting for development and testing, but we'd want to apply much stricter settings for production. Of course, we can edit this role later on in the IAM console, or create a new one.
Optionally, we can disable root access to the notebook instance, which helps lock down its configuration. We can also enable storage encryption using Amazon Key Management Service (https://aws.amazon.com/kms). Both features are extremely important in high-security environments, but we won't enable them here.
Once you've completed this step, your screen should look like this, although the name of the role will be different:
- As shown in the following screenshot, the optional Network section lets you pick the Amazon Virtual Private Cloud (VPC) where the instance will be launched. This is useful when you need tight control over network flows from and to the instance, for example, to deny it access to the internet. Let's not use this feature here:
- The optional Git repositories section lets you add one or more Git repositories that will be automatically cloned on the notebook instance when it's first created. You can select any public Git repository, or select one from a list of repositories that you previously defined in Amazon SageMaker: the latter can be done under Git repositories in the Notebook section of the left-hand vertical menu.
Let's clone one of my repositories to illustrate, and enter its name as seen in the following screenshot. Feel free to use your own!
- Last but not least, the optional Tags section lets us tag notebook instances. It's always good practice to tag AWS resources, as this makes it much easier to manage them later on. Let's add a couple of tags.
- As shown in the following screenshot, let's click on Create notebook instance:
Under the hood, SageMaker fires up a fully managed Amazon EC2 instance, using an Amazon Machine Image (AMI) preinstalled with Jupyter, Anaconda, deep learning libraries, and so on. Don't look for it in the EC2 console, you won't see it.
- Five to ten minutes later, the instance is in service, as shown in the following screenshot. Let's click on Open JupyterLab:
We'll jump straight into Jupyter Lab. As shown in the following screenshot, we see in the left-hand panel that the repository has been cloned. In the Launcher panel, we see the many conda environments that are readily available for TensorFlow, PyTorch, Apache MXNet, and more:
The rest is vanilla Jupyter, and you can get to work right away!
Coming back to the AWS console, we see that we can stop, start, and delete a notebook instance, as shown in the next screenshot:
Stopping a notebook instance is identical to stopping an Amazon EC2 instance: storage is persisted until the instance is started again.
When a notebook instance is stopped, you can then delete it: the storage will be destroyed, and you won't be charged for anything any longer.
If you're going to use this instance to run the examples in this book, I'd recommend stopping it and restarting it. This will save you the trouble of recreating it again and again, your work will be preserved, and the costs will really be minimal.