A Production Readiness Review is a consultation between a product development and SRE team covering an initial product launch or subsequent major change launch. They have two objectives:
- The service meets accepted standards of operational readiness, and service owners are prepared to work with the SRE team to take advantage of expertise.
- Improve the reliability of the service in production, minimising incident volume and impact.
Consultation may be insufficient if the service:
- has grown by orders of magnitude since launch; or
- has many dependents, serving significantly more traffic than originally intended.
A PRR might take place at any one of these stages of the Software Development Lifecycle:
- Build and implement
First decides which SRE team is a good fit for production take-over, selecting 1-3 members to conduct the process with the development team. This discussion should cover:
- Establishing SLO/SLA
- Design changes
- Planning/training schedules
Seek agreement over the process, end goals and outcomes required for SRE engagement.
The analysis stage lets SRE reviewers learn about the service's design and implementation to start identifying possible shortcomings based on the axes of concern. Where necessary, SRE reviewers should consult with other teams with more experience/domain knowledge with certain components or dependencies.
SRE teams should maintain a checklist for such reviews, covering items like:
- Updates too disruptive?
- Are upstream services being used as intended, and are their SLOs adequate?
- Requesting high enough network QoS for critical services?
- Are errors appropriately reported to logging systems?
- Are all failure conditions adequately reported?
- Well instrumented/monitored, with alerting?
- Operational standards, such as a review of recent incidents, postmortems and followups to gauge hygiene.
Improvements and refactoring
Identification and recommendation of improvements to the service should follow:
- Prioritise improvements based upon reliability importance.
- Discuss and negotiate with development team to agree plan of execution.
- SRE and product developments collaborate on refactoring the service and implementing improvements.
The SRE reviewers involved in the PRR have responsibility for training the wider SRE team assuming responsibility for the service. They should arrange training sessions and exercises based on:
- A design overview.
- A description of the production environment.
- Deep dives on request flows.
- Hands-on exercises for operations.
Onboarding is the progressive transfer of responsibility from the product development team to the SRE team of the production environment (operations, change management process and access management). It requires the development team to maintain availability for backup/advice as the team assumes ownership, which becomes the basis of the ongoing engagement.
As the team learns more about the service through incident response, change review and root cause analysis they should continue to share expertise with the development team in the form of suggestions and proposals. Lessons learned should be contributed back to any production guides maintained by the SRE team.
- Additional communication causes process overhead and additional cognitive load to reviewers.
- The SRE reviewers must be available, and able to manage their time and priorities around existing commitments.
- Work done by the SRE team must be highly visible and receive sufficient development team review to ensure knowledge is shared; SRE must behave as a part of the development team.
- Engagement starts very late in the process, after production launch. Issues could be identified much earlier in the process.