Monitoring

Monitoring systems should answer two questions: what's broken and why? Monitoring allows us to:

identify long term trends;
compare metrics across time or experiments;
alert on possible faults;
visualise data in dashboards; and
conduct retrospective analysis: what else happened when this service faulted?

Key concepts

Monitoring describes the collection, processing, aggregation and real-time display of quantitative data, e.g. transaction counts and processing times.
White-box monitoring covers just system internals.
Black-box monitoring covers just user-visible behaviours of a system.
A dashboard surfaces metrics for operator attention.
An alert is a human-facing notification.
A root cause is the underlying defect which, if repaired, instills confidence that an incident will not recur.

Golden signals

Latency -- time taken to process transactions.
Traffic -- measure of demand (rate of transactions).
Errors -- percentage of failed transactions.
Saturation -- how "full" the service is.

Distribution

The mean is often not what we want to monitor: it ignores differentiation by distribution.

Consider frequency of measurement and aggregation; you may want to measure and store CPU utilisation based on 5% granularity buckets every second, aggregating on the values once per minute.

Histograms

Histograms represent distributions of numerical values as bars of different heights and widths. Heights are proportional to frequency, and widths represents class intervals.

Quantiles

Quantiles are buckets within which values fall.

Alerting

Asking these questions when introducing rules will avoid false positives and pager burnout:

Does this rule detect an otherwise undetected condition that's urgent, actionable and actively or imminently user-visible?
Can I ever ignore the alert knowing it's benign? When and why? How can I avoid this case?
Are users definitely being affected? Can I differentiate between active and traffic-drained or test deployments?
Can I action this alert? Could it wait until the morning? Can it safely be automated? Will my action be a short-term workaround or long-term fix?
Is anyone else being paged? If so, which responder is most relevant?

Every time we page someone we're expecting them to respond urgently. Every page must be actionable, and must require human intelligence to make a call. Pages should be for novel problems or events which haven't been seen before; frequent pages with rote responses are a red flag.

When the volume of pages is too high, attrition rises and the volume of interruptions can prevent progress: it's sometimes necessary to reduce SLOs to allow SREs to focus on resolving the root causes.

Backlinks

Site Reliability Engineering (public)