Monitoring and alerting
Logging and tracing provide a wealth of information about how components of your cloud systems are behaving, but they generally only provide a partial picture of system behavior as a whole. Many important aspects of system health exist outside the scope of logging and tracing. Very often these aspects are best measured in terms of change over time, allowing developers to identify trends and anomalies.
Building on our to-do example, a sudden spike in concurrent connections to our todos-db
Cloud SQL instance may indicate that a recently pushed version of todos-backend
is not correctly terminating stale connections. Likewise, identifying patterns in user traffic to our todos-frontend
may allow us to identify optimal maintenance windows or eagerly scale ahead of demand.
Additionally, while collecting the right data is important to effectively monitor cloud systems and triage issues, it does not provide developers with the early awareness of issues required to minimize downtime...