Culture

SRE teams should aim to build an environment that:

  • Builds trust and facilitates psychological safety between team members.
  • Encourages grounded risk taking without fear of retribution.
  • Leaves engineers open to growth through experimentation and learning.
  • Fosters a sense of belonging: inclusion and commitment.

Vision statements

SRE teams need a unified team vision statement that clarifies their contribution to the company's vision, which should cover:

  • Values define response to other teams, commitment to personal and organisational goals, the way the team spends its time and the way it operates.
  • Purpose grants satisfaction, encourages connections between team members and reduces conflict.
  • The team's mission aligns its organisational contribution with the organisation's values. It should consider externally identified threats and opportunities and internal capacity and capabilities.
  • Strategy defines the required change to implement the mission in line with purpose, making effective use of the organisation's available capabilities to address threats and opportunities.
  • Goals define the end goal, and should be highly ambitious. Using OKRs encourages engineers to try new things, prioritise research and development activities and learn new things.

Communication in Production Meetings

SREs must constantly engage with product development teams in order to progress their products. This requires that their work be highly visible and subject to review.

This might be enabled by way of weekly service-oriented meetings. This will be chaired by a designated leader from the SRE team, take place over a period of 30-60 minutes and be considered compulsory for all members of the team, including key product stakeholders and partner development teams. The agenda should be sent around in advance, and should cover:

  1. Upcoming production changes, ensuring awareness such that adequate cover can be arranged.
  2. Reviewing metrics (SLOs), even if no major events occurred, to give a general feeling for the system's performance.
  3. Outages should review prior incidents or postmortems for lessons learned.
  4. Paging events is an opportunity to review the list of pages and review who received them, whether it paged appropriately, and whether the page required immediate action; if not, it's a candidate for removal.
  5. Non-paging events a. Issues that should have paged, but didn't; we need to fix the monitoring. b. An issue that can't page, but needs attention; to direct prioritisation of instrumentation, mitigation and fixes.
  6. A review of previous action items, to be tracked as with other project work.

Collaboration

When discussing collaboration we must be mindful of the distributed nature of operations teams and the challenges of cross-team, cross-site or virtual team collaboration.

SREs bring value to an organisation through technical mastery over domains, and naturally seek means of gaining it in the context of a service. This usually leads to specialisation, increasing the odds of attaining mastery at the expense of siloisation and ignorance of the broader picture.

Studies show that diversity in team makeup increases coverage of cognitive biases and improves communication.

Team composition

SRE teams must fill the following roles, and may do so either through dynamic negotiation/adoption of the roles as required or through static assignment of roles to individuals:

  • The Tech Lead provides the team's technical direction using practices such as code review, quarterly direction presentations and building consensus.
  • Managers have two additional responsibilities: performance management and serving as a general catchall for anything not already handled by someone else.
  • Project Managers manage project delivery.

Whilst static allocation may result in quicker decision making, it comes at the expense of generally more rounded individuals.

Internally

Singleton engineering projects usually fail, particularly when carried out by individuals who aren't particularly gifted or when solving non-trivial problems. Assume you need multiple people.

Externally

In SRE, good work requires excellent communication skills outside of team boundaries. Great written communication should be supplemented by face to face meetings, including travel to remote teams.

Evolving the engagement model used for SRE-developer interaction to include SREs earlier in the design where it's appropriate for a given service will lead to efficiency savings on later rework to retrofit services for stable production operation.

Using OKRs can align roadmaps to ensure both teams sufficiently value reliability.

Knowledge sharing

SREs are skilled in and perform more than one job function and switch between them as required, so cultivating knowledge sharing between team members is important to ensure all team members are sufficiently flexible. This has numerous benefits to both the individual and the organisation: reducing costs due to scheduling inefficiencies,, improving morale, reducing turnover and boosting productivity.

Cost-effective means of fostering a culture of knowledge sharing include:

  • Employee to employee networks where training opportunities take place through self-organising teams who prepare materials and engage in one-on-one discussions where necessary.
  • Job shadowing and pairing exposes engineers to different knowledge and experience, possible across functions, offering a psychologically safe environment to raise questions. Shadowing is ideal for creating gradual change, allowing engineers to identify the nuances of particular roles, and will scale.