Site Reliability Engineering

Site Reliability Engineering (Platform Engineering, Production Engineering) is an engineering discipline enabling organisations to sustainably achieve the appropriate level of reliability in their platforms.

It applies software (Private) engineering principles to IT operations and service management. It can be considered a narrow implementation of DevOps, and is aimed at giving operators agency over their work.

Founding principles:

Operations is a software problem.
Manage by Service Level Objectives.
Work to minimise toil.
Automate this year's job away.
Move fast by reducing the cost of failure.
Share ownership with developers.
Use the same tooling, regardless of function or job title (but APIs will outlive tools).

Primary responsibilities:

Monitor everything
Reduce toil through automation and problem reduction
Manage risk through SLIs, SLOs and an error budget
Documenting and sharing knowledge, encouraging best practice
Building resilient-enough services, early in the design phase
Remediating escalations
Carrying a pager and being on-call
Learn from outages using meaningful postmortems

Early examples

Margaret Hamilton on the Apollo program

Focuses

Nuances differ, but key focuses are commonly:

Availability
Latency
Performance
Efficiency
Change management
Monitoring
Emergency response
Capacity planning

Core tenets

Ensuring a durable focus on engineering by capping toil at 50% and diverting excess work (on-call rota, bugs) at the product team, producing postmortems for all incidents.
Maximising change velocity without violating SLOs through use of an error budget to address the reliability vs innovation conflict. SRE recognises that there are many obstacles to 100% availability and that aiming for such is rarely valuable.
Monitoring should alert only at the point action needs to be taken. Less critical notifications should be ticketed, and background noise should be relegated to logs.
Change management, acknowledging that 70% of operational incidents are caused by changes, and reducing impact using progressive rollouts (see Continuous delivery), improving detection of problems and rolling back safely.
Demand forecasting and capacity planning: ensuring there's sufficient capacity for user traffic through regular load tests based on accurate organic demand forecasts and inorganic event sources.
Provisioning of instances based on capacity planning exercises.
Efficiency and performance of the provisioned system must be maintained through monitoring and assessment of cost and performance.

Embrace risk

Extreme reliability is costly; costs trend exponentially toward infinity for each additional nine. Often unconsidered is the opportunity cost of lost sales, caused by missed opportunities for product innovation.

Risk is measured against uptime (Nines of reliability). In a single region uptime can be measured in time:

availability = uptime / total time

Uptime for a multi-region service might be based on aggregate transactions:

availability = successful requests / total requests

Set quarterly targets and measure performance on a daily or weekly basis. Targets might consider:

Expected level of service.
Revenue generating?
Paid or free?
Competitor level of service
The audience: consumers or enterprise

Risk tolerance should differ across failure modes: exposing users' data to the wrong audiences would be more harmful than a partial service outage.

The failure cases differ by workload too: throughput vs latency vs reliability.

Automate this year's job away

Toil is operational work of little lasting value that can be automated away or removed entirely through reworking of software.

Software engineering involves writing or modifying source code, either for automation or making robustness improvements.
Systems engineering is system configuration or documentation for the purpose of making lasting improvements.
Toil is work tied to operating a service that is manual, repetitive, automatable, has no lasting value, and scales proportionally to the service's growth.
Overhead might be ticketing system hygiene, process improvement or HR activities like training.

Toil can be cathartic in lower volume but, as it scales, can drive low morale and cause career stagnation. Note the different tolerances for toil amongst different SREs.

Release engineering

Philosophy:

Self-service, enabling teams to be self-sufficient and determine their own release pace.
High velocity teams want to reduce the lag time between features being completed and being available in production, e.g. push-on-green.
Hermetic builds provide consistency and repeatability, allowing building historical versions in the event we need to troubleshoot a failure mode and cherry-picking fixes from newer branches onto existing deployed branches.
Enforcement of policies and procedures, allowing gating operations that need review; e.g. source code and configuration changes that require review can't be merged to master prior to receipt of an approval.

Configuration management approaches differ by the change frequency and how it aligns with deployments. Prefer building static values in to the binary or as part of packaging where possible.

Simplicity

Software is inherently dynamic and unstable; total stability is possible only inside a vacuum. Our job is to maintain the balance between agility and stability.

Boring won't wake you up at 3am. Avoid over-engineering, and don't be afraid of purging old code; it'll still be in the source code management history anyway and removing code reduces risk, maintenance burden and complexity.

Focused releases are easier to troubleshoot: change will happen, so minimising the scope of a release will help with isolation of a problem later.

Minimal APIs are the hallmark of a well understood problem. Modularity can introduce complexity, but can be used to demarcate different responsibilities between teams

Foundations

Practices

The Service Reliability Hierarchy applies Maslow's Hierarchy of Needs to service delivery. In order to deliver higher levels of the hierarchy the baser levels of the hierarchy must be met.

Product
Development
Capacity planning
Testing and release procedures
Postmortem/Root cause analysis
Incident response
Monitoring

Children

Backlinks

Milestones (public)