You really should learn regular expressions
Here’s another advice post. Luckily, many of you can test out of it, like a college Gen Ed requirement. Here’s the test:
- What does the following regular expression do?
^http[s]?://([a-z]+\.)?example\.com/$
(Answer below.)
The target audience for this post is people who have heard of regular expressions, but don’t use them. Or who have used them a little, but have the feeling they really should know them better.
You’re right. You should.
Regular expressions (“regexes”) can be found almost anywhere text and code meet. In my own work I’ve used them for Apache configuration (mod_rewrite rules); Postfix configuration (anti-spam rules); input validation for web forms; Procmail mail processing rules; extracting data from crufty text files; Django URL configurations; search and replace operations in BBEdit, TextMate, and Emacs; and general utility programming in Python, PHP, Perl, and Ruby. Not to mention good old Unix grep
.
Gaining regex literacy can be tough. They look so damned ugly. When you first encounter them you think, “There’s got to be a more elegant solution!” Perhaps there is, but I’ve never seen it, and every “simpler” alternative to regexes suffers from one or more of: unsuited to complex problems; no simpler than regexes when used for complex problems; unworkably verbose.
I was driven to finally get a handle on them about four years ago because of spam. I started using mailfilter, a command-line POP client that deleted mail based on rules like DENY=^From:.*savebig
. After writing a few hundred rules for that system, I moved on to other spam-rejecting techniques, but I was sold on regex power.
One relatively obscure regex tip I’ll offer is to look into the “verbose” option that most regex engines provide. It’s explained well in this article from onlamp.com, which gives the following example for parsing phone numbers:
\(? # optional parentheses
\d{3} # area code required
\)? # optional parentheses
[-\s.]? # separator is either a dash, a space, or a period.
\d{3} # 3-digit prefix
[-\s.] # another separator
\d{4} # 4-digit line number
The part of each line after the “#” is a comment. Tricky, involved regexes are worth documenting just like other tricky, involved algorithms.
A funny side note about regexes is that they’re kind of like driving directions: there are many ways to get from point A (your source text) to point B (a match), and in any suitably large group of coders you’ll hear different opinions on what the fastest or clearest or safest one is.
For playing with regexes and learning how they work, an interactive tool can be very helpful. For example, Regex Coach for Windows and Linux or RegexPlor for the Mac (though I’m not clear on whether that’s being maintained). The Komodo IDE also has an interactive regex tool called “RxToolkit.”
I’ll close with a quote attributed to Jamie Zawinski:
Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.
Funny, but too cautious. Besides, if you see every text processing problem as an opportunity to use regexes, at some point you’re bound to get good at them!
Answer to the “test”: I was tempted to leave this as an “exercise for the reader.” The given regex matches the following URLs (among others!):
Randal L. Schwartz commented on Mon Dec 5 11:05:13 2005:
The O’Reilly book “Mastering Regular Expressions” is highly recommended. We also spend three chapters on regex in “Learning Perl”, and a few more in the sequel, “Learning Perl Objects References and Modules”.
Paul commented on Mon Dec 5 11:14:55 2005:
Thanks for the comment, Randal – I meant to mention the O’Reilly books, particularly the “Regular Expression Pocket Reference” that I keep on my desk.
Kiran Kumar commented on Tue Jan 3 23:08:48 2006:
The title " You really should learn regular expressions" is quit e ironic. I have begun my undergraduate studies with regexps and experimenting with them. I always wondered how can any one concieve of programming without knowing what regexs are - and asking ‘programmers’ to learn regexps makes one feel like they are retrofitted appendicies to a programmer’s knowledge. I belive its not!
Paul commented on Wed Jan 4 15:49:37 2006:
I agree it sounds odd, but remember that not everyone comes to programming by way of formal CS education. In the web world especially, there are people who start with HTML, then learn some Javascript, then a little PHP, then some Java or Perl or Python or Ruby. Such a person can get pretty far in ignorance of seemingly obscure things like regular expressions. I wrote this post with the hopefulness of an evangelist. If I can bring even one lost soul into the light…
KoRnouille commented on Thu May 11 13:15:41 2006:
Someone from #django pointed me to this article. I’m a Python fan and taught myself linux’s world and Python programming (OO). And I was always “scared” of regexes (specially comming from the clean python syntax environement) and until now, I’ve just been getting around the problem by some ugly hacks. But now I’m kinda stuck with django’s url mapping scheme. So I am one of these lost souls in the light you’ve been talking about. :)
Paul commented on Thu May 11 13:50:04 2006:
Well, welcome, lost soul!
Perl-style regexes do have a very unpythonic flavor. But for the reasons I outline above they’re unlikely to go away soon.
(I was that someone on #django, by the way!)