How it should go when you screw up

A couple weeks ago I accidentally replaced our live, production database with a 17-hour old snapshot.

I had intended to target a test server, not production. I didn’t realize what I had done for an hour or so, and by that time I had already left work. (I was actually out walking my dog. Not a good setting for managing a production crisis.)

Here’s how my team and I handled it.

  1. I immediately shared my realization with the team via Slack on my phone.

  2. Available team members dove into confirming, assessing, and mitigating. I followed the threads on Slack on my phone while I hustled home. Our focus was on minimizing pain for our users and the business.

  3. At first we thought that the 17-hour-old snapshot I used was our newest backup. But fortunately, one of the engineers had been making more frequent snapshots as part of a new project. He had been done with work for hours (he’s in a different time zone), but when he saw the chatter on Slack he jumped in to help. We used his backup.

  4. That left us with a 90 minute gap of lost data. Luckily, our user monitoring tool (FullStory) gave us critical info on which users had potentially lost data, i.e. the ones who entered data some time after that very last backup and before my DB overwrite.

  5. After we had reached a stable state, I stayed on to double-check things and write up a summary to broadcast to the team (including those who hadn’t been online).

  6. We individually contacted the handful of users affected.

  7. The next day, we had a postmortem meeting to discuss the incident. This practice is important for building teams that can learn from mistakes. It’s “blameless” — the focus is on what happened, how we responded, what the business impact was, and what we can do to reduce the chance of recurrence.

An important part of prevention, in my book, is taking steps more concrete and realistic than “try not to make that mistake again.” In this case we focused on making it harder to accidentally change the production DB.

I made a bad mistake and the team rose to the occasion. Whew.



Share: