Implementation of SRE is a journey:
- Avoid narrow/rigid incentives as they'll be gamed. Allow SREs the freedom to close the feedback loop between design and production. Systems with SRE input at design time will more than likely be better operationally regardless of who holds the pager.
- Fix it yourself, don't blame others: blaming others creates distinct groups and won't foster shared ownership. Larger organisations may allow withdrawing support for "irredeemably operationally difficult" products to avoid burnout, incentivising product teams to improve; smaller ones may use "when" rather than "whether".
- Consider reliability work a specialised role; practitioners benefit from peers and a career ladder.
- Parity of esteem: SREs are as much engineers as those in product.
Engineering is a constrained resource, and SLOs can be used to make data-driven prioritisation calls based on innovation, reliability and scalability. SLOs should be defended in the short term and maintained in the medium to long term.
A common pattern is the number of good event divided by the total number of events:
- Number of successful HTTP requests / total HTTP requests (success rate)
- Number of gRPC call that completed successfully in < 100ms / total gRPC requests
- Number of search results that used the entire corpus / total number of results, including graceful degradations
- Number of "stock check count" requests that used data fresher than 10 minutes / total stock check count requests
- "Good user minutes" defined by some criteria / total user minutes
The nature of this range (0% (nothing works) - 100% (nothing is wrong)) makes it easier to comprehend and to use tooling to monitor them over time.
- Specification describes the user-facing service outcome.
- Implementation includes the specification and a means of measuring it.
Initial attempt doesn't need to be correct: focus on the feedback cycle.
- Choose an application.
- Define your users.
- Consider common interactions.
- Draw a high-level architecture diagram including key components, request and data flow, and critical dependencies.
Start small, using indicators that require minimal engineering effort: if you have logs but no probes, use them.
SLOs should be evaluated at consistent time intervals, which can be either rolling or fixed. Tie evaluation to user experience. A 4 week window is a good starting point.
It's recommended to select an integral number of weeks to ensure that the same number of weekends are included, avoiding uninteresting differences in metrics.
Shorter time windows reduce lead-in time and let us move more quickly, but longer better for strategic evaluation: it's difficult to prioritise larger work items if the opportunity cost and reliability impact are unclear.
- Product management must agree that performance below these thresholds is unacceptably low and worth dedicating engineering resource to fix.
- Product development must agree that they will take steps to reduce risk to users upon exhaustion of the error budget, until the service is back within budget.
- The operations team defending the SLO needs to agree that it's defensible without Herculean effort, excessive toil, and burnout.
Document SLOs, ideally through GitOps, for future engineers that won't have the context you did at the time of setting them. Include:
- Authors, reviewers and approvers.
- Approval and next review dates.
- Brief service description.
- SLO details, objectives, specification and implementations.
- Details of error budget calculation and consumption.
- The rationale backing the numbers, and whether they were derived from experimental or observational data.
Implementing an error budget
Error budget policies are written, approved documents that answer questions like:
- What actions take place when the service exhausts the error budget?
- Who is responsible for them?
The development team focuses exclusively on reliability issues until the system is within SLO. This responsibility comes with high-level approval to push back external feature requests and mandates.
Like SLOs, the error budget policy must be agreed with all stakeholders:
- SREs must feel that SLOs are defensible without excessive toil, else they should make a case for relaxing it.
- The development team and product manager must feel that release velocity won't fall below acceptable levels due to the additional effort required in fixing reliability issues. The number if situations in which SREs will respond will be lowered proportionally with the SLO.
- If the product manager fears the SLO will result in a bad experience for a significant number of users before the policy prompts action the SLO may need tightening.
Getting to universal agreement may require iteration, each time determining whether you need more data, resources or changes in order to gain acceptance. Once accepted, document:
- Policy authors, reviewers and approvers
- Approval and next review dates
- Brief service description
- Actions to be taken in case of error budget exhaustion
- Escalation path if there's disagreement over the calculation or if the agreed upon actions are appropriate
- An overview of error budgets for those not familiar with SRE
Without these prerequisites, the error budget becomes just another KPI:
- There are SLOs that all stakeholders in the organisation have approved as fit for the product.
- People responsible for ensuring the service meets the SLO have agreed it's possible to meet it under normal circumstances.
- The organisation commits to using the error budget for decision making and prioritisation.
- A process is in place for refining SLOs.
Dashboards showing in-time snapshots of SLO compliance and their trends against the previous quarter/year help chart performance. Showing burn of SLOs within the current window helps contextualise risk.
Using sources of information on user happiness, refine SLOs to better measure customer happiness. Where an SLO needs tightening to match customer expectations but isn't yet defensible, use an aspirational SLO: begin tracking the SLI alongside others, but don't yet enforce it.
In order to be an effective tool the error budget policy must be consistently applied. Remember that the scale of incident is proportional to the amount of the error budget consumed:
- Exhausted error budget or seen unsustainable burn rate? Stop all feature launches.
- Extreme situation? Declare an emergency, with higher level approval, and deprioritise all external demands.
- Modelling important user journeys, allowing tighter SLOs.
- Bucketing interactions by audience or responsiveness to give different. SLOs
- Modelling dependencies -- upstream systems should have at least the same SLO as downstream.
- Monitoring whether individual customers are within SLO is unlikely to be useful, but in aggregate it can be a measure of effectiveness.
First, identify baseline requirements for the monitoring system:
- Freshness of data and speed of retrieval; outdated or slow data isn't good for reporting.
- Consider calculations (aggregations, window functions) and necessary retention (periods, percentiles).
- API access, dashboards (heatmaps, histograms, logarithmic scale), supporting drill-down by metadata.
- Alert classification for proportional response, suppression for duplicate alerts and routing to the correct team by metadata.
- Data sources: logs, metrics.
Managing monitoring systems
Treat the monitoring system as a production system. Where possible, codify its configuration and place it under source code management allowing effective documentation, review, and linting. Maintaining loose coupling allows for configuration changes and swapping out components for better-suited ones with minimal churn.
When building dashboards, consider the different audiences (executives, leadership, product management and engineering). Try to use consistent representations of metrics across dashboards to ease identification. For engineers, place SLIs front and centre, but include additional information to ease locating probable causes to further investigate during troubleshooting:
- Intended changes to the released version or configuration.
- Dependencies in storage or other RPC services.
- Resource utilisation and saturation of RAM, disk, CPU allocation, file descriptors, active threads, queue wait times, write volume, language-specific properties can indicate bugs (and aid capacity management).
- Served traffic status such as HTTP status codes, aggregate of denied requests due to exceeded quotas.
Testing is a multi-stage process:
- Binary reporting: check that metrics are exported and change correctly.
- Monitoring configurations: ensure rule evaluation produces expected results and that fault conditions trigger alerts.
- Alerting configurations: test generated alerts are correctly routed based on metadata.
When deciding on alerting strategy:
- Precision - proportion of detected events that were significantly user-impacting; 100% if every alert generated by an SLI captures an incident.
- Recall -- proportion of significant events detected; 100% if every significant event results in an alert.
- Detection time -- time taken to notify; shorter is better as it reduces the toll on the error budget.
- Reset time -- how long alerts fire after resolution; shorter is better.
Based on error rate
Using alerts of the form target error rate >= SLO threshold offers low detection time, better the smaller the window, but poor precision for smaller windows and poor reset times for larger windows. Using sustained durations to delay the alert increases the detection time for larger incidents. These are ill-advised.
alerting window size / reporting period (1 - SLO / error ratio) * alerting window size
Based on burn rate
Alerting on burn rate yields good precision, alerting only on major burn over a short time period. It has good detection time and a reasonable reset time.
((1 - SLO) / error ratio) * alerting window size * burn rate
(burn rate * alerting window size) / period
Using multiple burn rate alerts allows recording burns that don't require rapid action to a ticket system for investigation. Start with 2% in an hour and 5% in six hours for paging, and raise tickets on 10% consumption in three days. This will yield easier manageability and better precision and recall at the expense of additional complexity and a longer reset time.
Using multiple windows allows us to alert only when actively burning the budget by sampling a shorter time window in addition to a longer one. Start with the shorter window around 1/12th of the larger one. This method offers the most flexibility and good recall and precision at the expense of complexity.
Lower traffic services
Sometimes traffic follows a defined pattern over weekdays and weekends, but other systems just receive less traffic. These lower traffic services present challenges to monitoring:
- If alerting windows are too large the impact to users for any given outage can be high and detection times can be poor.
- If alerting windows are too small the service may constantly alert.
- Synthetic/artificial traffic offers a partial solution, but only where it's reasonable to effectively imitate user behaviour. Failing to effectively cover all user journeys may prevent legitimate alerts from being sent.
- Combining smaller, related services into a larger one is most likely to address the challenge.
- Making product changes to do additional retry may reduce alert volume.
Reliability and efficiency are founded in simplicity, thus leadership must value these product "features" and celebrate simplification projects as successes, like product launches. Amongst engineers, consider celebrating the deletion of source code over addition.
An unreliable platform might be a sign of a burgeoning complexity problem. Allow engineers to regularly identify known system complexities and brainstorm possible simplifications. Dedicating a small rotating subset of the SRE team to maintain knowledge across the entire stack to push for conformity and simplification may help.
Diagramming system interactions can identify cyclic dependencies (preventing cold starts due to self-referencing services) and amplification (due to retry).
Engagement models and onboarding
See Engagement models for a summary of engagement models and how businesses may evolve their approach over time.