E-Scribe : a programmer’s blog

About Me

PBX I'm Paul Bissex. I build web applications using open source software, especially Django. Started my career doing graphic design for newspapers and magazines in the '90s. Then wrote tech commentary and reviews for Wired, Salon, Chicago Tribune, and others you never heard of. Then I built operations software at a photography school. Then I helped big media serve 40 million pages a day. Then I worked on a translation services API doing millions of dollars of business. Now I'm building the core platform of a global startup accelerator. Feel free to email me.

Book

I co-wrote "Python Web Development with Django". It was the first book to cover the long-awaited Django 1.0. Published by Addison-Wesley and still in print!

Colophon

Built using Django, served with gunicorn and nginx. The database is SQLite. Hosted on a FreeBSD VPS at Johncompanies.com. Comment-spam protection by Akismet.

Elsewhere

Pile o'Tags

Stuff I Use

Bitbucket, Debian Linux, Django, Emacs, FreeBSD, Git, jQuery, LaunchBar, macOS, Markdown, Mercurial, Python, S3, SQLite, Sublime Text, xmonad

Spam Report

At least 236720 pieces of comment spam killed since 2008, mostly via Akismet.

How things get better after you screw up at work

(Hint: it's about your team.)

A couple weeks ago I accidentally replaced our live, production database with a 17-hour old snapshot.

This is an always-on application with users around the globe, so the mistake was likely to have blown away some new user-entered data.

I didn't realize what I had done for an hour or so (I thought I had targeted a test database server, not production). When it hit me, I had already left work. Here are the steps of how we handled it, with an emphasis on the “good engineering team culture” aspect:

  1. I immediately shared my realization of the crisis with the team. I did not try to fix it myself, or pretend I didn't know what had happened. I was able to do this because I knew the team had my back.

  2. Available team members immediately dove into confirming, assessing, and mitigating the problem. (Since I was in transit I was not yet able to get on a computer.) Focus was on minimizing pain for our users and the business, not on blame, resentment, or face-saving.

  3. User monitoring tools used by our UX person gave us critical info on which users had potentially lost data. We shared knowledge.

  4. We didn't think we had a more recent backup than the snapshot I had used — but one of the engineers had been making more frequent snapshots as part of a new project. He had been done with work for hours (he's in a different time zone), but when he saw the chatter on Slack he jumped in to help. He didn't say, “not my job.”

  5. After we had reached a stable state, people signed off, but I stayed on to double-check things and write up a summary to broadcast to the team. Communication is key.

  6. The next day, we scheduled a postmortem meeting to discuss the incident. This is a standard practice that's very important for building teams that can learn and grow from mistakes. It's “blameless” — the focus is on what happened, how we responded, what the business impact was, and what we can do to reduce the chance of recurrence. An important part of prevention is making measures more concrete and realistic than “try not to make that mistake.” In the end we lost only about 90 minutes of database history, and accounted for all user data added in that period.

I made a bad mistake, the team rose to the occasion, we were lucky to have good mitigation options, and we are making changes to reduce the chance of the mistake happening again. Win.

Sunday, October 7th, 2018
+ + +

Post a comment

Thanks for reading! Please note: Your comment will not appear until approved, which may take a few hours or more. Spammers will be torpedoed.


(Will not be shared)

(Optional)