(Hint: it’s about your team.)
A couple weeks ago I accidentally replaced our live, production database with a 17-hour old snapshot.
This is an always-on application with users around the globe, so the mistake was likely to have blown away some new user-entered data.
I didn’t realize what I had done for an hour or so (I thought I had targeted a test database server, not production). When it hit me, I had already left work. Here are the steps of how we handled it, with an emphasis on the “good engineering team culture” aspect:
I immediately shared my realization of the crisis with the team. I did not try to fix it myself, or pretend I didn’t know what had happened. I was able to do this because I knew the team had my back.
Available team members immediately dove into confirming, assessing, and mitigating the problem. (Since I was in transit I was not yet able to get on a computer.) Focus was on minimizing pain for our users and the business, not on blame, resentment, or face-saving.
User monitoring tools used by our UX person gave us critical info on which users had potentially lost data. We shared knowledge.
We didn’t think we had a more recent backup than the snapshot I had used — but one of the engineers had been making more frequent snapshots as part of a new project. He had been done with work for hours (he’s in a different time zone), but when he saw the chatter on Slack he jumped in to help. He didn’t say, “not my job.”
After we had reached a stable state, people signed off, but I stayed on to double-check things and write up a summary to broadcast to the team. Communication is key.
The next day, we scheduled a postmortem meeting to discuss the incident. This is a standard practice that’s very important for building teams that can learn and grow from mistakes. It’s “blameless” — the focus is on **what happened, how we responded, what the business impact was, and what we can do to reduce the chance of recurrence. **An important part of prevention is making measures more concrete and realistic than “try not to make that mistake.” In the end we lost only about 90 minutes of database history, and accounted for all user data added in that period.
I made a bad mistake, the team rose to the occasion, we were lucky to have good mitigation options, and we are making changes to reduce the chance of the mistake happening again. Win.