We need to keep the product running even if human errors happen, or natural disasters or data corruption. In this section, we will look at ways Amazon helps us to lower our time to remedy some common disruptions. Good practices for incident management and post-mortems help us to stay within our error budgets while strengthening our services. In our metrics-driven engineering practice, we recommend exercising these processes regularly in order to fine tune your existing SLOs and identify any missing ones.Â
Business continuity
Snapshots
Just as we use S3 to replicate our objects, CodeCommit to protect our source, and ECR for the durability of our container images (in the next chapter), we can use disk snapshots to...