From Operations Anti-Patterns, DevOps Solutions by Jeffery D. Smith
When something unexpected or unplanned occurs that creates an adverse effect on the system, I define that action as an incident. Some companies reserve the term incident for large catastrophic events, but with this broader definition you get to increase the learning opportunities on your team when an incident occurs. As mentioned previously, at the center of DevOps is this idea of continuous improvement. Incremental change is a win in a DevOps organization, but the fuel that powers that continuous improvement is continual learning. Learning about new technologies, existing technologies, how teams operate, how teams communicate and how all these things interrelate to form the human-technical systems that are engineering departments.
One of the best sources for learning isn’t when things go right, but when they go wrong. When things are working, what you think you know about the system and what’s true in the system aren’t necessarily in conflict. Imagine you have a car with a 15-gallon gas tank in it. For some reason, you think the gas tank has thirty gallons, but you have this habit of filling your gas tank after you burn through about ten gallons. If you do this religiously, your understanding of the size of the gas tank never comes into conflict with the reality of the gas tank being only fifteen gallons. You might make hundreds of trips in your car without ever learning a thing, but the minute you decide to take that long drive, you run into problems at sixteen gallons. Before long you realize the folly of your ways and you start taking the appropriate precautions now that you have this newfound information. You can do a few things with this information. You can dig deep to understand why your car ran out of gas at gallon fifteen or you can say “Welp, I better start filling up every five gallons now, to be safe.” You’d be amazed how many organizations opt to do the latter.
Many organizations don’t go through the mental exercise of understanding why the system performed the way it did and how it can improve. Incidents are a definitive way to prove if your understanding of the system matches reality. By not doing this exercise, you’re wasting the best parts of the incident. The failure to learn from such an event can be a disservice to future efforts.
The learnings from system failures don’t always come naturally. They often need to be coaxed out of the system and team members in an organized, structured fashion. This process is called by many names, after action reports, incident reports and retrospectives are a few terms, but I personally use the term post-mortem.
In this article I discuss the process and structure of the post-mortem, as well as how to get a deeper understanding of your systems by asking deeper, more probing questions about why engineers decided to take the action that they did.
The components of a good post-mortem
Whenever there’s an incident of enough size, people begin to play the blame game. People try to distance themselves from the problem, erect barriers to information and generally become helpful to the point of absolving themselves of fault. If you see this happening in your organization then it’s likely that you live in a culture of blame and retribution. By that I mean, the response to an incident is to find those who are responsible for “the mistake” and to make sure that they’re punished, shamed, or sidelined appropriately. Afterwards you’ll heap on a little extra process to make sure that someone must approve the type of work that created the incident. With a feeling of satisfaction everyone walks away from the incident knowing that this particular problem won’t happen again, but it always does.
The reason the blame game doesn’t work is because it attacks the people as the problem. If people had been better trained. If more people were aware of the change. If someone had followed the protocol. If someone hadn’t mistyped that command. And to be clear, these are all valid reasons why things might go wrong, but they don’t get to the heart of why that activity (or lack of) created such a catastrophic failure. Let’s take the training failure as an example. If the engineer wasn’t trained appropriately and made a mistake, you should ask yourself “why wasn’t he trained?” Where did he or she get trained? Was the training due to an engineer not having enough time? If they weren’t trained why were they given access to the system to perform something they weren’t ready to perform? The pattern with other line of thinking is that you’re discussing problems in the system versus problems in the individual. If your training program is poorly constructed, then blaming this engineer doesn’t solve the problem, because the next wave of hires might experience the same problem. Allowing someone who might not be qualified to perform a dangerous action might highlight a lack of systems and security controls in your organization. Left unchecked, your system continues to produce employees who are in a position to make this same mistake.
In order to move away from the blame game, you must begin thinking about how your systems, processes, documentation and understanding of the system all contribute to the incident state. If your post-mortems turn into exercises of retribution, then not only will no one participate, but you’ll also lose an opportunity for continued learning and growth.
Another side-effect of a blameful culture is a lack of transparency. Nobody wants to volunteer to get punished for a mistake they made. Chances are they’re already beating themselves up about it, but now you combine that with the public shaming that often accompanies blameful post-mortems and you have built in incentives for people to hide information about incidents or specific details about an incident. Imagine an incident that was created by an operator making a mistake entering a command. The operator knows that if he admits to this error that there will be some punishment waiting for him. If he has the ability to sit silently on this information he’s much more likely to do this as the group spends a large amount of time attempting to troubleshoot what happened.
A culture of retribution and blamefulness creates incentives for employees to be less truthful. The lack of candidness hinders your ability to learn from the incident and also obfuscates the facts of the incident. A blameless culture, where employees are free from retribution, creates an environment much more conducive to collaboration and learning. With a blameless culture, the attention shifts from everyone attempting to deflect blame, to solving the problems and gaps in knowledge that led to the incident.
Blameless cultures don’t happen overnight. It takes quite a bit of energy from co-workers and leaders in the organization to create an environment where people feel safe from reprisal and can begin having open and honest discussions about mistakes that were made and the environment in which they were made. You, the reader, can facilitate this transformation by being vulnerable and being the first to share their own mistakes with the team and the organization. There must always be someone who goes first and because you’re reading this article, that person is probably going to be you.
Creating mental models
Understanding how people look at systems and processes is key to understanding how failure happens. When you have an idea of how a system, you create a mental model of just about every system that you interact with.
DEFINITION A mental model is an explanation of someone’s thought process about how a thing works. The mental model might detail someone’s perception of the relationship and interaction between components, as well as how the behavior of one component might influence other components. A person’s mental models can often be incorrect or incomplete.
Unless you’re a total expert on that system, it’s reasonable to assume that your model has gaps in it. An example is that of a software engineer and their assumptions of what the production environment might look like. The engineer is aware that there’s a farm of web servers and a database server and a caching server. They’re aware of these things because those are the components that they touch and interact with on a regular basis, both in code and in their local development environments. What they’re probably unaware of is all the infrastructure components that go into make this application capable of handling production grade traffic. Database servers might have read replicas, web servers probably have a load balancer in front of them and a firewall in front of that. Figure 1 shows an engineer’s model versus the reality of the system.
Figure 1. The engineer’s mental model versus reality
It’s important to acknowledge this discrepancy not only in computer systems, but in processes as well. The gap between expectations and reality is a testing ground for incidents and failures. Use the post-mortem as an opportunity to update everyone’s mental model of the systems involved in the failure.
The 24-hour rule
The 24-hour rule is simple. If you have an incident in your environment, you should have a post-mortem about that incident within 24-hours. The reasons for this are two-fold. For starters, the details of the situation begin to evaporate the more time between when the incident occurs and when the incident is documented. Memories fade and nuance gets lost. When it comes to incidents, nuance makes all the difference. Did you restart that service before this error occurred or after? Did Sandra implement her fix first or was it Brian’s fix first? Did you forget that the service crashed the first time Frank started it back up? What could that mean? All of these little details may not mean much when you’re attempting to figure out what solved the issue, but it definitely matters in understanding how the incident unfolded and what you can learn from it.
Another reason to do the post-mortem within 24-hours is to be sure that you’re using the emotion and the energy behind the failure. If you’ve ever had a minor car accident, and hurt yourself, resulting in hospitalization or even the need to use ktape to speed up the recovery process, you become super alert following it. And that level of alertness and intensity sticks around for a certain amount of time, but sooner or later, you begin to fall back into your old habits. Before long, the sense of urgency has faded and you’re back to driving without a hands-free unit and responding to text messages as you’re at stoplights. Now imagine if you could instead use that short period of heightened awareness to put real controls in your car that prevent you from doing those poor or destructive actions in the first place. This is what you’re trying to do with the 24-hour rule. Seize the momentum of the incident and use it for something good.
When you have an incident, there’s usually a lot of pent-up energy around it because it’s an event that is out of the ordinary and typically someone is facing pressure and or repercussions from the failure. The more time that passes, the more the sense of urgency begins to fade. Get the ball rolling on follow up items to the incident within the first 24-hours while it’s still something of note to your team members.
Lastly, having the post-mortem within 24-hours helps to ensure that a post-mortem document gets created. Once the documents are created, not only can they be widely circulated for others to learn about the failure, but it also serves as a teaching tool for engineers of the future. Again, incidents have mounds of information in them, and being able to document a failure in enough detail could go a long way to training engineers in the future. (Or to use the incident as an interesting interview question for future engineers)
The rules of the post-mortem
Like any meeting, there needs to be certain guidelines set forth in order to have a successful post-mortem meeting. It’s important that you walk through these rules, in detail, prior to any post-mortem meeting. The rules are designed to create an atmosphere of collaboration and openness. Participants need to feel at ease and comfortable with admitting to gaps in their knowledge or their understanding of the system. Plenty of reasons cause team members to feel uncomfortable sharing this lack of expertise. You might have a company where the culture is to shun those that display even the slightest hint of lacking complete expertise. Your company culture might demand a level of perfection which is unrealistic, leading to team members who make mistakes or aren’t complete experts on a topic feeling inadequate. It’s also not uncommon for team members, for reasons of their own, to have this feeling of inadequacy. These negative emotions and experiences are blockers to total learning and understanding. You need to do your best job at trying to put those emotions to rest. This is the goal of these rules and guidelines.
- Never criticize a person directly. Focus on actions and behaviors.
- Assume that everyone did the best job they could with the information that was available to them at the time.
- Be aware that things may look obvious now but could have been obfuscated in the moment.
- Blame systems, not people.
- Remember that the ultimate goal is understanding all the pieces that went into the incident.
These rules help to focus the conversation on where it belongs, improving the system and hopefully keeping you out of the finger pointing blame game. It’ll be up to the meeting facilitator to ensure that these rules are always followed. If someone breaks the rule, even once, it can serve as a signal to the other participants that this is like any other meeting where management is looking for a “throat to choke.”
That’s all for this article.
If you want to learn more about the book, please check it out on our browser-based liveBook platform here.