Milestones
Implementing Site Reliability Engineering within an organisation is a marathon.
Getting started
- Some SLOs are defined and they're met most months
- Incident management process defined
- Blameless postmorterms
Beginner
-
Staffing and hiring plan in place
-
With staff in-place, team is on-call
-
Documentation for release, service setup and service teardown processes is in place
-
Canary release process has been evaluated as a function of the SLO
-
Rollback process is in place, where appropriate.
-
Operational playbook/runbook is in place, if incomplete.
-
Disaster recovery simulations are taking place, at least annually.
-
Team plans and executes project work, possibly independently from the development team.
-
Enough operational load to exercise incident response process at least once per week.
-
SRE team charter agreed beyond SRE (e.g. CTO).
-
Periodic dev/SRE leadership meetings to discuss goals and share information.
-
Project planning and execution is joint between dev and SRE. SRE work and impact is visible to dev leadership.
Intermediate
- Periodic reviews of SRE project work and impact with business leaders.
- Periodic reviews of SLIs and SLOs with business leaders.
- Low volume of toil (<= 50%). Configuration changes are applied in a way that takes into account reliability. SRE plans to scale impact beyond adding services to their on-call load.
- Rollback for canary failures is in place, maybe automated.
- Periodic incident management tests, combining role playing with automation.
- Escalation policy tied to SLI violations (e.g. release freeze).
- Periodic reviews of postmortems and action items shared between dev and SRE.
- DR periodically tested against non-production environments.
- Measure demand vscapacity and use active forecasting to determine when demand may exceed capacity.
- SRE team may produce long-term plans (annual roadmaps) jointly with devs.
Teams measure demand vs. capacity and use active forecasting to determine when demand might exceed capacity. The SRE team may produce long-term plans (i.e., a yearly roadmap) jointly with devs.
Advanced
More senior/broader teams:
- At least some individuals have had more major positive impact beyond firefighting or operations.
- Project work is "horizontally executed", positively affecting many services at once as opposed to linearly or worse per-service.
- Most alerts are based on SLO burn rate.
- Automated DR testing is in place, positive impact can be measured.
Unlikely to implemented by most:
- SREs not on call 24x7, geographically distributed between two or more locations.
- SRE and developer orgs share ommon goals, allowing separate reporting lines, avoiding conflicts of interest.
Next steps
- Are we meeting the business goals?
- Regular reviews to ensure churn doesn't lead to regressions.