Failure strategies
A major requirement for the DevOps team is to have a way to trigger a rollback automatically when any issue occurs in the production environment. The company has a strict service-level agreement that specifies that their Software as a Service product has 99.99% uptime. Therefore, if their application is unavailable for more than 5 minutes a month, that would break the uptime requirement.
The team has researched the most common issues that would break a production deployment, how to test for them, and what they need to monitor to account for any other issues that might show up. The problem they have is how they would trigger a rollback in a GitOps model automatically.
They could try to leverage an automated revert command based on the outcome of the deployment status in Kubernetes. But that would require some significant scripting, leveraging a script, and figuring out a way to pass the specific repository that had triggered the deployment originally. However...