A1 Engineering
Posts
How to build a healthy on-call culture

How to build a healthy on-call culture

Strategies for Sustainable 24/7 Support Without Burnout

ARJUN RAO
August 23, 2024

Photo by Matt C on Unsplash

💯 Importance of On-Call

As a company grows, the reliability of the platform and services will continue to come under increasing scrutiny. As an Engineering team that places premium on foundational principles like 12-factor app practices, production readiness reviews, robust testing, monitoring and consistent peer reviews to ensure code health, we have many measures in place from having bad deployments shipped to production. However, as any Engineer knows, while one can reduce the possibility of erroneous releases, getting that down to zero is really hard, at least without trading off on other dimensions like speed or cost.

Another vector of risk, is that we cannot control the behavior of our customer systems or 3rd party services we use, like AWS or Datadog. Failure in those systems is very likely to impact the performance or behavior of our systems. All of this means that as our company scales, we can try our best to prevent issues from happening, but that might not always be possible.

However, to reduce impact to our partners, the goal of reducing the Mean Time to Detect (MTTD) and Mean Time to Resolution (MTTR) of an incident is what we as a company should seek. Aspiring to get both those numbers as low as possible, is the fundamental purpose of an on-call practice. The sooner you learn about issues, the sooner you can fix them. Fixing problems as quickly as possible not only stops them from causing more damage; it’s also easier and cheaper.

📈 Expanding the On-Call rotation

As an Engineering team expands, it has become increasingly important to ensure the broader team is available to be on call. There are several benefits to having on-call experience. Some of these include (but are not restricted to)

📝 It equips and motivates engineers to build reliable, well-documented systems. Seeing firsthand how things go wrong in production powers insights into how systems can be improved and made more robust.
🔄 It builds a healthy feedback loop between the development process and the production execution of systems.
📚 Supporting both their own and others’ systems is a great learning opportunity for engineers. It provides valuable hands-on experience with infrastructure such as Kubernetes, caches, databases, as well as experience diagnosing faults and making operational decisions.
🏋️‍♀️ Having a deep on-call roster reduces the burden of support on any one engineer and spreads the love amongst the entire team as a whole.

☎️ What it means to be on-call

Being on-call means that you are able to be contacted at any time in order to investigate and fix issues that may arise for the system. For example, if you are on-call for services, should any alarms trigger for a service, you will receive a “page” (an alert on your mobile device, email, phone call, or SMS, etc.) giving you details on what is broken and how to fix it. You will be expected to take whatever actions are necessary in order to resolve the issue and return your service to a normal state.

On-call responsibilities extend beyond normal office hours, and if you are on-call, you are expected to be able to respond to issues — even at 2am on a long weekend. This sounds horrible (and it can be), but this is what our customers go through and rapid incident resolution is critical to the stability of the platform!

All teams need to have a healthy work-life balance and the ability for our team to take PTO to rest and recharge. Folks who go on-call should work with their managers and peers to ensure someone else can swap in for them during their time off. You don’t want employees to get burned out by not getting some R&R but you also don’t want gaps in your on-call rotation.

💪 Responsibilities of on-call staff

Responding to alerts that are firing for services that your team supports — during on and off business hours.
Ensuring on-call and troubleshooting documentation is up-to-date so that all on-call staff benefits from your findings going forward.
Answering inbound questions from stakeholders about system behavior, if it is directed to an entire channel (slack or otherwise) and to no one in particular.

🙅 NOT responsibilities of on-call staff

There is no expectation to fix all issues by yourself. We understand we’re not all experts in everything here — handle what you can, and do the necessary investigation and if you’re not able to resolve on your own ask for help. Do not hesitate to ask for support or escalate if needed.
Communicating with all stakeholders to make them aware of an ongoing issue. Generally this should be its own dedicated role — companies like PagerDuty call this Incident Commander.

🤝 Building a supportive on-call culture

When responding to incidents, engineers can always ask for help by paging other engineers. It’s a cornerstone of a healthy, high-trust on-call culture. Never feel ashamed to rope in someone else if you’re not sure how to resolve an issue. Likewise, never look down on someone else if they ask you for help. Nobody enjoys getting a support page in the middle of the night, but responding (if possible) is an important act of empathy for the engineer who needs help. It’s also an investment in training them to handle the situation autonomously in the future.
If an issue comes up during your on-call shift for which you got paged, you are responsible for resolving it. Even if it takes 3 hours and there’s only 1 hour left of your shift. You can hand over to the next on-call if they agree, but you should never assume that’s possible.
Always consider covering an hour or so of someone else’s on-call time if they request it and you are able. We all have lives which might get in the way of on-call time, and one day it might be you who needs to swap their on-call time in order to have a night out with your friend from out of town.
The most important cultural practice — which influences all the other cultural practices is fostering a learning culture rather than a blame culture. Learning from mistakes builds a stronger, more technically proficient engineering organization. Punishing people for mistakes makes engineers afraid to act in new situations, afraid to ask for help when they need it, and afraid to be transparent.

💥 On-Call <> Incident Management

On-Call rotations and Incident Management go hand-in-hand although they are not one and the same. To know more on my thoughts about Incident management, you can read this article.

On-call staff could don the mantle of Incident Commander if they have been trained in those areas, but is not expected to do so by default. The main responsibilities of the On-Call staff is to (as mentioned above) be the first line of defense and rally the troops as required, based on initial investigation.

👋 Parting thoughts

As the number of services within a company grows, we need to figure out what is the best way to set up our on-call staff for success. Whether its better documentation or mapping services to specific people based on Contributor status or determining something else that makes on-call an easier experience, will be an ongoing process.

Building a strong engineering culture and a strong on-call culture using healthy feedback loops, transparency and mutual respect for one another, is critical to the success of a company and the growth of our teams.

If you liked this article, please like ❤️ / restack 🔄 to spread the word! If you agree/disagree with anything, please leave comments or questions on the article and we can discuss it!

Reply

or to participate.