Ex-Google SRE on Alerting
The contents in this document is derived from a document titled My Philosophy on Alerting by Rob Ewaschuk.
The origiinal document is publicly available at this link.
Some ideas from the document linked above:
- over monitoring is a harder problem to solve than under monitoring
-
alert rules must be able to classify problems into one of the following classes:
- avaiiability and basic functionality
- correctness (completeness, freshness and durability of data)
- feature specific problem
- focus on symptoms to catch the problem. include cause-based information, but alert should be based on symptoms. For example, alert should be based on query failing rather than database server is down .