Tools of the Effective Developer: Regular Expressions
Whenever I suggest using regular expressions to solve a string parsing problem, more often than not I’m met with skepticism and frowning faces. Regular expressions have a bad reputation among many of my fellow developers.
(Yes, they are mostly Windows developers, Xnix users don’t seem to have this problem.)
But can you blame them? I mean, have a look at this regular expression for validating email addresses.
[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?
That’s enough to send any normally functioning individual out the door screaming. Fortunately, regular expressions aren’t always this messy. In fact, the simpler the problem, the more effective and beautiful they get.
Let’s look at an example that shows the power and density of regular expressions.
Suppose we have a chunk of text. We know that somewhere in it there’s a social security number. Our job is to extract it.
This is one of those tasks that involve a decent amount of code unless your language of choice has support for regular expressions. In Ruby, on the other hand, it’s a single line of code.
text = “Test data that 123-45-6789 contains a social security number.”
if text =~ /\d\d\d-\d\d-\d\d\d\d/
puts $~
else
puts “No match”
end
If the match operator (=~) and the magical match result variable ($1) puts you off, here’s how it’s done in C# that doesn’t have a special notation for regular expressions, but support them through the .Net framework.
String text = “Test data that 123-45-6789 contains a social security number.”;
Regex ssnReg = new Regex(@"\d\d\d-\d\d-\d\d\d\d");
Match match = ssnReg.Match(text);
if ( match.Success ) {
Console.WriteLine(match);
} else {
Console.WriteLine("No match");
}
A beautiful thing with regular expressions is that it’s really simple to extract the parts of a match. For instance, if we need to extract the area code all we need to do is to put parenthesis around the part we’re interested in. Then we can easily extract that information, in C# by using the match.Groups property.
String text = “Test data that 123-45-6789 contains a social security number.”;
Regex ssnReg = new Regex(@"(\d\d\d)-\d\d-\d\d\d\d");
Match match = ssnReg.Match(text);
if ( match.Success ) {
Console.WriteLine(match);
Console.WriteLine(“Area Number: “ + match.Groups[1]);
} else {
Console.WriteLine("No match");
}
As intimidating as they may seem, the payoff using regular expressions is huge. And, since efficiency is what we strive for, they have a natural place in our bag of tools. So, learn the basics of regular expressions. You’ll be happy you did.
Finally some advice sprung from my own experience with regular expressions.
• Get your regular expressions right first. Use Rubular, or an equivalent tool, for pain free experimentation before you implement them in your code.
• Document your regular expressions. They are notoriously difficult to read so a short description with example match data is almost always a good idea.
Cheers!
Previous posts in the Tools of The Effective Developer series:
- Tools of The Effective Developer: Personal Logs
- Tools of The Effective Developer: Personal Planning
- Tools of The Effective Developer: Programming By Intention
- Tools of The Effective Developer: Customer View
- Tools of The Effective Developer: Fail Fast!
- Tools of The Effective Developer: Make It Work – First!
- Tools of The Effective Developer: Whetstones
- Tools of The Effective Developer: Rule of Three
- Tools of The Effective Developer: Touch Typing
- Tools of The Effective Developer: Error Handling Infrastructure