Summary
This chapter covered SRE and the KPIs for running our service in production. We started by understanding software reliability and examined how to manage an application in production using SRE. We discussed the three crucial parameters that guide SREs: SLI, SLO, and SLA. We also explored error budgets and their importance in introducing changes within the system. Then, we looked at software disaster recovery, RPO, and RTO and how they define how complex or costly our disaster recovery measures will be. Finally, we looked at how DevOps or SRE will use these concepts to manage a distributed application in production.
In the next chapter, we will put what we’ve learned to practical use and explore how to manage all these aspects using a service mesh called Istio.