Error budgets

Implementing SRE brings to the surface tensions between development and SRE teams over how the product should be governed:

  • Software fault tolerance: how far do we harden the software against unexpected events?
  • Testing: too much is costly, but too little leads to increased toil.
  • Push frequency and size: deployments increase risk, but there's value in innovation.
  • Canary duration and size: how long should we wait, and what size should our increments be?

Error budgets aim to address these tensions between teams by formalising values in an agreed policy, granting both teams shared ownership.

Overview

Fundamentally, an error budget is an SLO based on meeting other SLOs. They exist to recognise that it's unrealistic (and costly: lost innovation carries an opportunity cost) to expect to meet SLOs 100% of the time, to help prevent chronic reliability issues and to help keep the product development and SRE teams aligned.

Management

Track error budget both monthly and annually to account for catastrophic incidents causing immediate burn in the immediate term.

Use burn rates as an indicator, not exhausting the error budget, to avoid getting stuck in a firefighting cycle.

Error budget policy

An executive-backed error budget policy empowers SREs to take action to ensure that reliability issues are addressed. Since the policy will likely be consistent across multiple teams, it may make sense to have an organisation-wide policy apply, even in larger organisations. The policy should:

  • Result in engineering to improve reliability.
  • Describe when this happens.
  • Describe how this happens (e.g. committing one developer to fixing all high priority items identified in the postmortem; giving pager back to the development team during times of constant error budget exhaustion).
  • Include consequences for this not happening (e.g. for a silver bullet release of a feature that may otherwise cause a breach of contract, identify why it was necessary).
  • Be consistently applied.
  • Document escalation path.
  • Be agreed upon and signed by all parties.

Document worked examples of escalations at defined thresholds to help keep the application of the policy clear. Remember that the policy may have to be applied in high-stress situations and keep its wording clear and concise:

  1. Automated alerts notify SRE of at-risk SLO: a. 9 hour budget burn. b. 36 hour budget burn.
  2. SREs conclude they need help to defined the SLO and escalate to developers.
  3. The 30-day error budget is exhausted and the root cause remains unfound; feature releases blocked, dev team allocates more resources.
  4. The 90-day error budget is exhausted and the root cause remains unfound; SRE escalates to executive leadership to obtain more engineering time for reliability work.