Milestones

Implementing Site Reliability Engineering within an organisation is a marathon.

Getting started

  • Some SLOs are defined and they're met most months
  • Incident management process defined
  • Blameless postmorterms

Beginner

  • Staffing and hiring plan in place

  • With staff in-place, team is on-call

  • Documentation for release, service setup and service teardown processes is in place

  • Canary release process has been evaluated as a function of the SLO

  • Rollback process is in place, where appropriate.

  • Operational playbook/runbook is in place, if incomplete.

  • Disaster recovery simulations are taking place, at least annually.

  • Team plans and executes project work, possibly independently from the development team.

  • Enough operational load to exercise incident response process at least once per week.

  • SRE team charter agreed beyond SRE (e.g. CTO).

  • Periodic dev/SRE leadership meetings to discuss goals and share information.

  • Project planning and execution is joint between dev and SRE. SRE work and impact is visible to dev leadership.

Intermediate

  • Periodic reviews of SRE project work and impact with business leaders.
  • Periodic reviews of SLIs and SLOs with business leaders.
  • Low volume of toil (<= 50%). Configuration changes are applied in a way that takes into account reliability. SRE plans to scale impact beyond adding services to their on-call load.
  • Rollback for canary failures is in place, maybe automated.
  • Periodic incident management tests, combining role playing with automation.
  • Escalation policy tied to SLI violations (e.g. release freeze).
  • Periodic reviews of postmortems and action items shared between dev and SRE.
  • DR periodically tested against non-production environments.
  • Measure demand vscapacity and use active forecasting to determine when demand may exceed capacity.
  • SRE team may produce long-term plans (annual roadmaps) jointly with devs.

Teams measure demand vs. capacity and use active forecasting to determine when demand might exceed capacity. The SRE team may produce long-term plans (i.e., a yearly roadmap) jointly with devs.

Advanced

More senior/broader teams:

  • At least some individuals have had more major positive impact beyond firefighting or operations.
  • Project work is "horizontally executed", positively affecting many services at once as opposed to linearly or worse per-service.
  • Most alerts are based on SLO burn rate.
  • Automated DR testing is in place, positive impact can be measured.

Unlikely to implemented by most:

  • SREs not on call 24x7, geographically distributed between two or more locations.
  • SRE and developer orgs share ommon goals, allowing separate reporting lines, avoiding conflicts of interest.

Next steps

  • Are we meeting the business goals?
  • Regular reviews to ensure churn doesn't lead to regressions.