You’ve been there. Someone on your team just screwed it up. Your production website went down in the middle of the night, it took hours to bring it back up. It’s 10am the next day, you’re at your daily standup, and the culprit is looking down, ashamed and quiet; the team is noticeably uncomfortable and is expecting you, their leader, to scream and shout about business impact and accountability and how bad this all is.
You’re upset. The outage already cost your group some reputation — you’re seeing tweets and a message from the investor, and you have no idea how something this dumb could have been overlooked.
You can allow your emotions to take over. You can do the screaming, you can shame the perpetrator, who will undoubtedly remember this occasion and probably won’t make a mistake of this kind again. You will scare others at the standup enough for them to be afraid of their own shadow for the next week.
Or you can take a breath.
And ask yourself.
What impact do I want to make on this group right now? How do I want this group to be different 15 minutes from now?
There’s only thing that matters when the sky has already fallen: how you prevent the issue from happening again. How can you make sure that your group, your system, your infrastructure is stronger as a result of this incident, and the same screwup cannot happen for *structural reasons*? That a repeat is almost physically impossible?
So utter the following words to the team: “How can we make sure an issue like this can never hit us in the future?” Here are some reasonable answers for typical situations:
- Website went down and no one noticed? Add Pingdom monitoring and have the text messaging alerts go to the entire team.
- Someone checked in a ridiculous bug that broke half the service? Add automation that tests a big chunk of your service end-to-end.
- Your email newsletter was sent five times to each recipient? Monitor the number of emails sent to every recipient in the database. If anyone received more than one email in the last 24 hours, automatically turn off the master switch.
You’ll notice that each of these remedies is heavy-handed — a crude overkill — and intentionally so. This is to accomplish the second most important rule of “the day after sky has fallen”: Your structural, automated precaution from a repeat must be implemented the same day. The culprit drops what they are doing — heck, they were kicked out of their context by the outage already — and works tirelessly to implement it. Over time, the crude solution can be evolved to be a little bit more elegant.
Final rule for the day: No screaming, no shaming, no humiliating public speeches about accountability. Provided that you don’t have a talent issue, the culprit feels terrible already. By focusing on the solution — and making the team stronger — you not only inoculate the team; you motivate them by staying in the trenches, instead of speaking down. This cements the culture of continuous improvement — and collaboration, instead of fear.
(this article was originally published on VentureBeat)