Implementing Site Reliability Engineering within an organisation is a marathon.
- Some SLOs are defined and they're met most months
- Incident management process defined
- Blameless postmorterms
Staffing and hiring plan in place
With staff in-place, team is on-call
Documentation for release, service setup and service teardown processes is in place
Canary release process has been evaluated as a function of the SLO
Rollback process is in place, where appropriate.
Operational playbook/runbook is in place, if incomplete.
Disaster recovery simulations are taking place, at least annually.
Team plans and executes project work, possibly independently from the development team.
Enough operational load to exercise incident response process at least once per week.
SRE team charter agreed beyond SRE (e.g. CTO).
Periodic dev/SRE leadership meetings to discuss goals and share information.
Project planning and execution is joint between dev and SRE. SRE work and impact is visible to dev leadership.
- Periodic reviews of SRE project work and impact with business leaders.
- Periodic reviews of SLIs and SLOs with business leaders.
- Low volume of toil (<= 50%). Configuration changes are applied in a way that takes into account reliability. SRE plans to scale impact beyond adding services to their on-call load.
- Rollback for canary failures is in place, maybe automated.
- Periodic incident management tests, combining role playing with automation.
- Escalation policy tied to SLI violations (e.g. release freeze).
- Periodic reviews of postmortems and action items shared between dev and SRE.
- DR periodically tested against non-production environments.
- Measure demand vscapacity and use active forecasting to determine when demand may exceed capacity.
- SRE team may produce long-term plans (annual roadmaps) jointly with devs.
Teams measure demand vs. capacity and use active forecasting to determine when demand might exceed capacity. The SRE team may produce long-term plans (i.e., a yearly roadmap) jointly with devs.
More senior/broader teams:
- At least some individuals have had more major positive impact beyond firefighting or operations.
- Project work is "horizontally executed", positively affecting many services at once as opposed to linearly or worse per-service.
- Most alerts are based on SLO burn rate.
- Automated DR testing is in place, positive impact can be measured.
Unlikely to implemented by most:
- SREs not on call 24x7, geographically distributed between two or more locations.
- SRE and developer orgs share ommon goals, allowing separate reporting lines, avoiding conflicts of interest.
- Are we meeting the business goals?
- Regular reviews to ensure churn doesn't lead to regressions.