The cost of failure is education.
SREs must be comfortable with failure. This requires discussions about failures that enable engineers to discuss the impact of an incident and identify root causes to direct future engineering and process improvement work. Postmortems are learning opportunities, not punishments.
Postmortems require extensive collaboration between multiple teams, and are therefore costly to carry out. The choice to write up a postmortem should be deliberate.
- User-visible downtime beyond a permitted threshold.
- Any loss of data.
- On-call engineer intervention.
- Resolution time beyond a permitted threshold.
- A monitoring failure, causing manual discovery.
- A stakeholder request.
Everyone involved in the response to the incident should be involved in writing up the postmortem. The goal is to explore what happened and identify and prioritise changes that can be made to prevent recurrences.
Postmortems should include:
- Details of the incident, including a timeline of events.
- Actions taken to mitigate and resolve.
- Impact to customers, and to other services.
- Trigger and root cause, or causes.
- Follow-up actions to prevent a recurrence.
Remember: our goal is to ensure the failure mode is understood and to prevent a recurrence of a future stressful outage. Improvements aren't limited to just the platforms in question -- if an additional metric or signal would have been useful there's value in raising it here.
Psychological safety is the belief that people won't be punished or humiliated for raising concerns, questions or mistakes.
Low safety environments rob teams of learning opportunities, stifling learning and innovation as engineers keep concerns to self. Rather than being concerned with making improvements to better everyone's working experience, engineers are concerned with not looking incompetent or ignorant or being ridiculed for a stupid question.
Breed psychological safety by:
- Treating work as a learning problem, not an execution problem.
- Acknowledging your own fallibility, and knowing when to ask for help.
- Modelling curiosity.
Psychologically safe workplaces tend to:
- Encourage bridging, fostering high levels of cooperation between silos.
- Avoid punishing messengers for delivering bad news.
- Treat failure as an opportunity for future improvement.
- Welcome new ideas and perspectives.
Such organisations benefit from a reduction in lead time, an increase in deployment frequency, and improved time to restore.
Blame creeps in with:
- Hindsight bias, the tendency to overestimate ability to predict an unpredictable outcome, leading to blaming the person in charge for missing the obvious.
- Discomfort discharge, blaming others to discharge discomfort and pain at a neurobiological level, manifested as finger pointing.
Both behaviours can be avoided by switching the focus of postmortems from people to systems and processes, and assuming good faith in engineers' actions. In general, avoid attributing pages to a specific team.
Tools used to support the postmortem process should facilitate real-time collaboration between the teams, allow open commenting and annotation and allow looping in others in the organisation who may have valuable insight.
Review ensures that the postmortem is complete, and technically correct. Circulate the first draft internally and have a group of senior engineers assess it for completion and correctness on criteria such as:
- Whether key incident data is included for posterity.
- Completeness of impact assessments.
- The depth of the root cause analysis.
- The appropriateness of the identified bug fixes and their priorities.
- Whether the outcome was shared with the relevant stakeholders.
Postmortems should be shared with the wider organisation in order to ensure that the lessons learned are visible to other teams who may benefit.
Instilling the culture
The value of postmortems is in their subsequent review and actioning. Setting up some of the following might help the process gain traction:
- Postmortem of the month
- Postmortem reading club
- Wheel of misfortune
- Internal chat or messaging board