We can further enhance our approach, based on lessons learned:

  • High lead in time due to complex scheduling requirements of reviewers, leading to stricter prioritisation and serialisation of takeovers.
  • Differing practice across services led to different implementation styles. Meeting SRE standards required significant rework, wasting engineering effort. Examples include logging and instrumentation.
  • Certain patterns can be observed in common service issues and outages, but fixes can't easily be applied across services, particularly in overload and data hotspotting.
  • SRE contributions are localised to service and aren't generic enough for reuse, limiting the spread of SRE knowledge to other services.

External factors

Business pressure also affects the SRE team:

  • As software increasingly moves to a microservices model, the number of services needing SRE support scales, demanding further staffing.
  • Hiring experienced and qualified SREs is costly, requiring enormous effort from recruitment and a large training budget and time commitment to bring them up to speed.
  • The above prevent the SRE team from engaging with many teams whose services are deemed below current priorities. This mandate calls for extending SRE support beyond the engagement model.


Frameworks offer a structural solution to the problem, incorporating:

  • Codified best practices based on lessons learned, providing a base upon which future services can be developed.
  • Reusable solutions to known scalability and reliability issues.
  • Common production platforms and control surface, addressing instrumentation and operational controls from the beginning.
  • Easier automation and smarter systems, allowing e.g. distributed tracing without service-specific middleware.

These pre-approved frameworks can be modelled in multiple languages/runtimes and domains as necessary.