E-Scribe : a programmer’s blog

About Me

PBX I'm Paul Bissex. I build web applications using open source software, especially Django. Started my career doing graphic design for newspapers and magazines in the '90s. Then wrote tech commentary and reviews for Wired, Salon, Chicago Tribune, and others you never heard of. Then I built operations software at a photography school. Then I helped big media serve 40 million pages a day. Then I worked on a translation services API doing millions of dollars of business. Now I'm building the core platform of a global startup accelerator. Feel free to email me.

Book

I co-wrote "Python Web Development with Django". It was the first book to cover the long-awaited Django 1.0. Published by Addison-Wesley and still in print!

Colophon

Built using Django, served with gunicorn and nginx. The database is SQLite. Hosted on a FreeBSD VPS at Johncompanies.com. Comment-spam protection by Akismet.

Elsewhere

Pile o'Tags

Stuff I Use

Bitbucket, Debian Linux, Django, Emacs, FreeBSD, Git, jQuery, LaunchBar, macOS, Markdown, Mercurial, Python, S3, SQLite, Sublime Text, xmonad

Spam Report

At least 236604 pieces of comment spam killed since 2008, mostly via Akismet.

You really should learn regular expressions

Here's another advice post. Luckily, many of you can test out of it, like a college Gen Ed requirement. Here's the test:

  1. What does the following regular expression do? ^http[s]?://([a-z]+\.)?example\.com/$ (Answer below.)

The target audience for this post is people who have heard of regular expressions, but don't use them. Or who have used them a little, but have the feeling they really should know them better.

You're right. You should.

Regular expressions ("regexes") can be found almost anywhere text and code meet. In my own work I've used them for Apache configuration (mod_rewrite rules); Postfix configuration (anti-spam rules); input validation for web forms; Procmail mail processing rules; extracting data from crufty text files; Django URL configurations; search and replace operations in BBEdit, TextMate, and Emacs; and general utility programming in Python, PHP, Perl, and Ruby. Not to mention good old Unix grep.

Gaining regex literacy can be tough. They look so damned ugly. When you first encounter them you think, "There's got to be a more elegant solution!" Perhaps there is, but I've never seen it, and every "simpler" alternative to regexes suffers from one or more of: unsuited to complex problems; no simpler than regexes when used for complex problems; unworkably verbose.

I was driven to finally get a handle on them about four years ago because of spam. I started using mailfilter, a command-line POP client that deleted mail based on rules like DENY=^From:.*savebig. After writing a few hundred rules for that system, I moved on to other spam-rejecting techniques, but I was sold on regex power.

One relatively obscure regex tip I'll offer is to look into the "verbose" option that most regex engines provide. It's explained well in this article from onlamp.com, which gives the following example for parsing phone numbers:

\(?     # optional parentheses
\d{3}   # area code required
\)?     # optional parentheses
[-\s.]? # separator is either a dash, a space, or a period.
\d{3}   # 3-digit prefix
[-\s.]  # another separator
\d{4}   # 4-digit line number

The part of each line after the "#" is a comment. Tricky, involved regexes are worth documenting just like other tricky, involved algorithms.

A funny side note about regexes is that they're kind of like driving directions: there are many ways to get from point A (your source text) to point B (a match), and in any suitably large group of coders you'll hear different opinions on what the fastest or clearest or safest one is.

For playing with regexes and learning how they work, an interactive tool can be very helpful. For example, Regex Coach for Windows and Linux or RegexPlor for the Mac (though I'm not clear on whether that's being maintained). The Komodo IDE also has an interactive regex tool called "RxToolkit."

I'll close with a quote attributed to Jamie Zawinski:

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

Funny, but too cautious. Besides, if you see every text processing problem as an opportunity to use regexes, at some point you're bound to get good at them!

Answer to the "test": I was tempted to leave this as an "exercise for the reader." The given regex matches the following URLs (among others!):

Sunday, December 4th, 2005
+
6 comments

Comment from Randal L. Schwartz , later that day

The O'Reilly book "Mastering Regular Expressions" is highly recommended. We also spend three chapters on regex in "Learning Perl", and a few more in the sequel, "Learning Perl Objects References and Modules".

Comment from Paul , later that day

Thanks for the comment, Randal -- I meant to mention the O'Reilly books, particularly the "Regular Expression Pocket Reference" that I keep on my desk.

Comment from Kiran Kumar , 4 weeks later

The title " You really should learn regular expressions" is quit e ironic. I have begun my undergraduate studies with regexps and experimenting with them. I always wondered how can any one concieve of programming without knowing what regexs are - and asking 'programmers' to learn regexps makes one feel like they are retrofitted appendicies to a programmer's knowledge. I belive its not!

Comment from Paul , 4 weeks later

I agree it sounds odd, but remember that not everyone comes to programming by way of formal CS education. In the web world especially, there are people who start with HTML, then learn some Javascript, then a little PHP, then some Java or Perl or Python or Ruby. Such a person can get pretty far in ignorance of seemingly obscure things like regular expressions. I wrote this post with the hopefulness of an evangelist. If I can bring even one lost soul into the light...

Comment from KoRnouille , 22 weeks later

Someone from #django pointed me to this article. I'm a Python fan and taught myself linux's world and Python programming (OO). And I was always "scared" of regexes (specially comming from the clean python syntax environement) and until now, I've just been getting around the problem by some ugly hacks. But now I'm kinda stuck with django's url mapping scheme. So I am one of these lost souls in the light you've been talking about. :)

Comment from Paul , 22 weeks later

Well, welcome, lost soul!

Perl-style regexes do have a very unpythonic flavor. But for the reasons I outline above they're unlikely to go away soon.

(I was that someone on #django, by the way!)

Comments are closed for this post. But I welcome questions/comments via email or Twitter.