SRE at Google

SRE was founded and first implemented at Google in 2003 by Ben Treynor Sloss. Google's implementation provides an origin story for SRE.

Technology

Google's technology stack is an interesting enabler for SRE practices, and can serve as a model for implementers:

Networking

Compute

Storage

Databases

Release engineering

Workforce and culture

  • 50-60% of SREs are software engineers, remaining 40-50% are close with additional skills (Unix administration and layers 1-3 of the OSI model).
  • Belief in and aptitude for developing software solutions.
  • 50% cap on operations work, aiming for much lower.
  • Single shared repository for all internal software.
  • Code yellow: development team pulled onto resolving priority backlog items to prevent future incidents, with executive approval.
  • SRE and product development teams are separate, with prod dev paying for SRE growth through hires upon successful launch.

Specifically when applied to Google Cloud Platform:

  • CREs, or Customer Reliability Engineers, are customer-facing SREs.

DevOps

Google has an institutional blindness to DevOps: since there's no traditional IT function, DevOps is just "what people do".

Team size

To allow for emergency response without burnout:

  • 8 people if bound to a single region
  • 6 in each geography