Service Availability Calculus
The Calculus of Service Availability states that:
You're only as available as the sum of your dependencies.
Expanding on this slightly: a service can only be as reliable as its the sum of its unique (since a failure will affect the service regardless of where it is in the dependency hierarchy, and likely only once over) dependencies.
availability = MTTF / (MTTF + MTTR)
Rule of the extra 9
The "golden rule of component reliability" is that all upstream dependencies of a service must offer one additional nine.
Mitigations
Mitigations generally fall into one or more of these categories:
- Using a capacity cache
- Gracefully degrading behaviour
- Failing open
They can approach the problem by reducing:
- Frequency of failure through reliability improvement work.
- Scope via sharding, isolation, graceful degradation or customer isolation.
- MTTR via monitoring/automation enhancements or changes to incident management policy.
Practical examples of mitigations include:
- Working with dependency service owners during launches to ensure sufficient capacity exists in upstream systems.
- Adding redundancy and fault isolation to the system.
- Working failover and fallback into interactions with dependencies.
- Performing interactions with dependencies asynchronously reduces the latency impact of a failed request.
- Plan launches in the context of capacity management.
- Reduce configuration differences between deployments to reduce complexity: the less paths through the code, the better tested the used ones are likely to be.
- Improving monitoring systems to better detect issues, and improving troubleshooting direction to speed up identification of the problem.
- Implementing fast and reliable fallback to another instance, e.g. in a different region.
- Thoroughly test the software, making effective use of integration testing.
- Plan for the future, paying close attention to the impact of introducing new dependencies into services.