Service Availability Calculus

The Calculus of Service Availability states that:

You're only as available as the sum of your dependencies.

Expanding on this slightly: a service can only be as reliable as its the sum of its unique (since a failure will affect the service regardless of where it is in the dependency hierarchy, and likely only once over) dependencies.

availability = MTTF / (MTTF + MTTR)

Rule of the extra 9

The "golden rule of component reliability" is that all upstream dependencies of a service must offer one additional nine.

Mitigations

Mitigations generally fall into one or more of these categories:

They can approach the problem by reducing:

  • Frequency of failure through reliability improvement work.
  • Scope via sharding, isolation, graceful degradation or customer isolation.
  • MTTR via monitoring/automation enhancements or changes to incident management policy.

Practical examples of mitigations include:

  • Working with dependency service owners during launches to ensure sufficient capacity exists in upstream systems.
  • Adding redundancy and fault isolation to the system.
  • Working failover and fallback into interactions with dependencies.
  • Performing interactions with dependencies asynchronously reduces the latency impact of a failed request.
  • Plan launches in the context of capacity management.
  • Reduce configuration differences between deployments to reduce complexity: the less paths through the code, the better tested the used ones are likely to be.
  • Improving monitoring systems to better detect issues, and improving troubleshooting direction to speed up identification of the problem.
  • Implementing fast and reliable fallback to another instance, e.g. in a different region.
  • Thoroughly test the software, making effective use of integration testing.
  • Plan for the future, paying close attention to the impact of introducing new dependencies into services.