For George Michael.
Posted by Rick Cusick on Sunday, December 25, 2016
(A guest-post I wrote for the Scalyr blog)
One of the inevitable joys of working in DevOps is “the page” — that dreaded notification from your alerting system that something has gone terribly wrong…and you’re the lucky person who gets to fix it.At Scalyr, we’ve got a few decades of collective DevOps experience and we’ve all been on the receiving end of a page. Even though we do our best to avoid being woken up, it happens.
In this post, we’re going to put some of that experience to use and show you how to handle an incident the right way. You’ll learn not only how to fix the immediate problem, but how to grow from the experience and set your team up for smooth sailing in the future.
A Bit of Context
It’s first worth acknowledging that all advice – this included – is contextual. The processes that would slow you down and frustrate you as an ops team of 5 people may be just right when you have 50, and not be enough for when you have 500 (let along 5000).
The advice here is aimed at ops teams of at least several engineers, a couple of engineering teams, and a client base that relies on you, but who doesn’t necessarily expect you to deliver 5 9’s.
Time Now Pays Off Later
Whether from instinct, training, or history, your first temptation is likely to charge in and fix the problem as quickly as humanly possible. Drop everything, lock the doors, hide the cat, get the site back up NOW.
Resolving the issue is indeed a top priority, but it’s not the only priority. Learning, communicating, and preventing a recurrence are equally worthwhile goals. And getting there may mean taking some counterintuitive actions. Preventing future downtime can sometimes mean taking a few extra (precious) minutes during the event to gather information, record steps taken, and communicate with your team.
This might feel frustrating in the moment. But “that which does not kill me makes me stronger” works a lot better if you come out having learned something.
Communicate, Learn, Act
The pattern we use is “Communicate -> Learn -> Act”. We organize our work into each of these areas and iterate until the incident is resolved.
Communicate means what you think: share information. Logs, user reports, instincts, whatever. But don’t act in a vacuum, especially in the heat of the moment. And this doesn’t just apply to teams — a note to Future You can be just as valuable as an full-on incident report to your teammates.
Learn means gather information. New logs. New errors. System updates. Resolutions. Why things changed. Why things didn’t change. What worked and what didn’t.
Act means take corrective action, both short- and long-term.
I’m On It
The first step in an incident response is to acknowledge the page and communicate to your team that you’re on the issue.
The most common scenario is a notification from your alerting system. But there are other avenues: A customer can reach out on social media, or a teammate can simply send a text because something just, well, seems funky.
A good alerting system will have a method for acknowledging the issue, thus letting everyone know you’re on the case. But absent that, take responsibility and notify your teammates and the folks who reported the issue. And do this before you take corrective action.
(Side note: If you’re not happy with the state of your alerting rules, or aren’t sure you have good coverage, we published a comprehensive guide on how to set alerts.)
This is also a good gut check moment. There are some problems you just know are going to be serious almost instantly. Don’t be afraid to wake someone up. Sometimes a “Boss Level Event” requires a team effort.
- acknowledge the page when it comes in,
- confirm you are taking the lead in responding,
- gut check if you need to bring in more resources.
Get Your Learn On
Now you’re in your first learning phase. Your first goal is to determine the scope of the issue. There are two relevant dimensions here: quantitatively, as in the size of the issue (in terms of # of affected hosts, services, or users), and qualitatively, as in the severity of the issue (e.g. is the site down, or just a bit slow?).
Your second learning goal gets to the heart of system administration — finding the root cause. This is where you use every tool at your disposal – logs, system metrics, monitors, and alerts – to find out what happened, what’s still happening, and why. (Obligatory plug: if you’re not happy with your tools here, we can probably help.)
A good technique is to open an email and paste the details you uncover at each step straight into the body. This helps you keep track of your investigation and makes it easy to write your after-action report. Alternatively, if you use a group chat system you can open up an incident-specific channel for a real-time discussion of everything related to the investigation. Then you can refer to the chat log when writing the post-mortem.
To summarize the learning phase:
- gather data on the scope of the problem,
- use your tools to perform root cause analysis, and
- document as you go.
Alright Mr. DeMille, I’m Ready for my Close-up
Now it’s time to act. You may be so lucky as to have a playbook for the issue you uncovered. If not, look on the bright side: here’s a perfect opportunity to start or add to your playbook collection! Over time, your team should be able to build up a good set of instructions for responding to most incidents.
Playbooks should aim to document procedures and participants, not step-by-step indicent resolution. If your organization is stable enough to keep having the same problems over and over, isn’t resolving them, and is instead relying on a human to do step-by-step remediation via playbook, it can be extremely frustrating for the team over time.
Of course, there will be incidents where no procedure has been established and you’re treading ground that’s unfamiliar to you. In these cases, know your limits and be willing to ask for help.
The largest risk you face is that the issue worsens during (or because of) your response. So it’s critical that you understand the scope of your actions. Take action and confirm your expected respone before proceeding. Always be ready to retrace your steps when needed and focus on what actions might be irreversible while troubleshooting.
To summarize response actions:
- review existing playbooks, or, if none, start building your collection,
- confirm the effect of all actions that you take, and
- limit risk by having a path to undo changes.
Closing a Loop / Opening a Loop
Now you’re back at the top of the “Communicate -> Learn -> Act” loop.
Tell your team why you took the actions you did. Tell them what you expected each action to do and whether it worked. Communicate in as close to real-time as possible to create a natural recorded timeline — invaluable when it comes to writing up the post-mortem.
Keep your leadership team informed. Good management is about supporting the mission, and your personal mission during an incident is to get everything running again. Let your management team know what’s happening, how and in what ways users will be impacted, and what steps are being taken to fix things.
Keep your users informed. If the problem hasn’t been resolved, make sure you let your users know that you know there’s a problem. Frustrated customers often turn to social media if something’s up with your application, so the more you can get ahead of the issue, the better. Share the status of your progress and share when users can expect the next update. Then make sure you hit your deadlines (and, if not, communicate that too!)
Providing information at regular intervals gives customers the confidence they need to manage expectations on their side. If they know the next update on the situation is coming in 2 hours, they can spend the time elsewhere in their lives instead of wondering (or vocally complaining about) when more information is coming.
In summary – things to remember when closing the loop on communication:
- document why each action is taken,
- keep your team informed,
- keep leadership informed,
- keep users informed,
- establish a communication timeline, and
- provide updates at regular intervals.
This Too Shall Pass / There Will Be a Next Time
The good news is this: All issues eventually get resolved (unless you just give up and quit. Please don’t do that.) You and the team will pull through. It’s the rare post-mortem that ends with “and so then we all went to find other jobs”.
The important next step is to future-proof your system. Most incidents are repeat incidents; if you can avoid them, you’ll avoid almost everything.
A relatively mature DevOps team should have the goal of never getting paged about the exact same thing more than once. If you’re at an early-stage company, it could be too heavyweight to expect this — things will be changing too quickly. But as your organization matures, keep this in mind.
Take some time to reflect on what happened, your response, the team’s response, and how things were resolved. This is critical to improving the application. Even if the incident is fresh in your mind and a formal post-mortem seems excessive, do it. Memories fade quickly.
When writing the post-mortem, make note of things that worked (so they can be repeated next time.) Note things that didn’t work (so they can be avoided next time.) And make note of what you learned – how you can improve, and how your team can improve. And then – most importantly – put that knowledge into use. Build new playbooks, improve existing playbooks, and add new data sources so the next time a page happens, you and your team will have everything needed to succeed.
Great teams turn every incident into an opportunity to build trust in the company, build trust in the team, and build a stronger product.
Finally – go back and review that initial alert. How and when did it arrive? Was the information it provided useful? Did it catch the core issue? In our guide on how to set alerts, one of the most critical phases we cover is iterating & tuning. A real-life incident is a perfect opportunity to make improvements and optimize your alerts.
To sum up, when things settle:
- schedule a retrospective with the team to review the event,
- identify what went well and what needs to change,
- write a post-mortem highlighting what happened and what you’ve learned,
- create and update documentation for future events, and
- use the incident to tune your alerts
In upcoming articles, we’ll dig deeper into both “the before” — how to structure your DevOps organization to best respond to incidents, and “the after” — how to structure your post-mortems to get the most out of what’s happened.
Have any questions or comments about any of this? We’d love to hear from you! Drop a comment below, and let’s make incident response better and more effective for everyone.
Special thanks to Charity Majors, who provided excellent feedback on drafts of this post.