Deploying a pre-trained model to an asynchronous inference endpoint
In addition to real-time and serverless inference endpoints, SageMaker also offers a third option when deploying models – asynchronous inference endpoints. Why is it called asynchronous? For one thing, instead of expecting the results to be available immediately, requests are queued, and results are made available asynchronously. This works for ML requirements that involve one or more of the following:
- Large input payloads (up to 1 GB)
- A long prediction processing duration (up to 15 minutes)
A good use case for asynchronous inference endpoints would be for ML models that are used to detect objects in large video files (which may take more than 60 seconds to complete). In this case, an inference may take a few minutes instead of a few seconds.
How do we use asynchronous inference endpoints? To invoke an asynchronous inference endpoint, we do the following:
- The request payload is...