Site Reliability Engineering (Platform Engineering, Production Engineering) is an engineering discipline enabling organisations to sustainably achieve the appropriate level of reliability in their platforms.
- Operations is a software problem.
- Manage by Service Level Objectives.
- Work to minimise toil.
- Automate this year's job away.
- Move fast by reducing the cost of failure.
- Share ownership with developers.
- Use the same tooling, regardless of function or job title (but APIs will outlive tools).
- Monitor everything
- Reduce toil through automation and problem reduction
- Manage risk through SLIs, SLOs and an error budget
- Documenting and sharing knowledge, encouraging best practice
- Building resilient-enough services, early in the design phase
- Remediating escalations
- Carrying a pager and being on-call
- Learn from outages using meaningful postmortems
- Margaret Hamilton on the Apollo program
Nuances differ, but key focuses are commonly:
- Change management
- Emergency response
- Capacity planning
- Ensuring a durable focus on engineering by capping toil at 50% and diverting excess work (on-call rota, bugs) at the product team, producing postmortems for all incidents.
- Maximising change velocity without violating SLOs through use of an error budget to address the reliability vs innovation conflict. SRE recognises that there are many obstacles to 100% availability and that aiming for such is rarely valuable.
- Monitoring should alert only at the point action needs to be taken. Less critical notifications should be ticketed, and background noise should be relegated to logs.
- Change management, acknowledging that 70% of operational incidents are caused by changes, and reducing impact using progressive rollouts (see Continuous delivery), improving detection of problems and rolling back safely.
- Demand forecasting and capacity planning: ensuring there's sufficient capacity for user traffic through regular load tests based on accurate organic demand forecasts and inorganic event sources.
- Provisioning of instances based on capacity planning exercises.
- Efficiency and performance of the provisioned system must be maintained through monitoring and assessment of cost and performance.
Extreme reliability is costly; costs trend exponentially toward infinity for each additional nine. Often unconsidered is the opportunity cost of lost sales, caused by missed opportunities for product innovation.
Risk is measured against uptime (Nines of reliability). In a single region uptime can be measured in time:
availability = uptime / total time
Uptime for a multi-region service might be based on aggregate transactions:
availability = successful requests / total requests
Set quarterly targets and measure performance on a daily or weekly basis. Targets might consider:
- Expected level of service.
- Revenue generating?
- Paid or free?
- Competitor level of service
- The audience: consumers or enterprise
Risk tolerance should differ across failure modes: exposing users' data to the wrong audiences would be more harmful than a partial service outage.
The failure cases differ by workload too: throughput vs latency vs reliability.
Automate this year's job away
Toil is operational work of little lasting value that can be automated away or removed entirely through reworking of software.
- Software engineering involves writing or modifying source code, either for automation or making robustness improvements.
- Systems engineering is system configuration or documentation for the purpose of making lasting improvements.
- Toil is work tied to operating a service that is manual, repetitive, automatable, has no lasting value, and scales proportionally to the service's growth.
- Overhead might be ticketing system hygiene, process improvement or HR activities like training.
Toil can be cathartic in lower volume but, as it scales, can drive low morale and cause career stagnation. Note the different tolerances for toil amongst different SREs.
- Self-service, enabling teams to be self-sufficient and determine their own release pace.
- High velocity teams want to reduce the lag time between features being completed and being available in production, e.g. push-on-green.
- Hermetic builds provide consistency and repeatability, allowing building historical versions in the event we need to troubleshoot a failure mode and cherry-picking fixes from newer branches onto existing deployed branches.
- Enforcement of policies and procedures, allowing gating operations that need review; e.g. source code and configuration changes that require review can't be merged to
masterprior to receipt of an approval.
Configuration management approaches differ by the change frequency and how it aligns with deployments. Prefer building static values in to the binary or as part of packaging where possible.
Software is inherently dynamic and unstable; total stability is possible only inside a vacuum. Our job is to maintain the balance between agility and stability.
Boring won't wake you up at 3am. Avoid over-engineering, and don't be afraid of purging old code; it'll still be in the source code management history anyway and removing code reduces risk, maintenance burden and complexity.
Focused releases are easier to troubleshoot: change will happen, so minimising the scope of a release will help with isolation of a problem later.
Minimal APIs are the hallmark of a well understood problem. Modularity can introduce complexity, but can be used to demarcate different responsibilities between teams
The Service Reliability Hierarchy applies Maslow's Hierarchy of Needs to service delivery. In order to deliver higher levels of the hierarchy the baser levels of the hierarchy must be met.
- Capacity planning
- Testing and release procedures
- Postmortem/Root cause analysis
- Incident response
*[IT]: Information Technology