Home > C#, ruby, software development > Tools of the Effective Developer: Regular Expressions

Tools of the Effective Developer: Regular Expressions

Whenever I suggest using regular expressions to solve a string parsing problem, more often than not I’m met with skepticism and frowning faces. Regular expressions have a bad reputation among many of my fellow developers.

(Yes, they are mostly Windows developers, Xnix users don’t seem to have this problem.)

But can you blame them? I mean, have a look at this regular expression for validating email addresses.

[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?

That’s enough to send any normally functioning individual out the door screaming. Fortunately, regular expressions aren’t always this messy. In fact, the simpler the problem, the more effective and beautiful they get.

Let’s look at an example that shows the power and density of regular expressions.
Suppose we have a chunk of text. We know that somewhere in it there’s a social security number. Our job is to extract it.

This is one of those tasks that involve a decent amount of code unless your language of choice has support for regular expressions. In Ruby, on the other hand, it’s a single line of code.


text = “Test data that 123-45-6789 contains a  social security number.”

if text =~ /\d\d\d-\d\d-\d\d\d\d/
  puts $~
else
  puts “No match”
end

If  the match operator (=~) and the magical match result variable ($1) puts you off, here’s how it’s done in C# that doesn’t have a special notation for regular expressions, but support them through the .Net framework.


String text = “Test data that 123-45-6789 contains a  social security number.”;

Regex ssnReg = new Regex(@"\d\d\d-\d\d-\d\d\d\d");
Match match = ssnReg.Match(text);

if ( match.Success ) {
  Console.WriteLine(match);
} else {
  Console.WriteLine("No match");
}

A beautiful thing with regular expressions is that it’s really simple to extract the parts of a match. For instance, if we need to extract the area code all we need to do is to put parenthesis around the part we’re interested in. Then we can easily extract that information, in C# by using the match.Groups property.


String text = “Test data that 123-45-6789 contains a  social security number.”;

Regex ssnReg = new Regex(@"(\d\d\d)-\d\d-\d\d\d\d");
Match match = ssnReg.Match(text);

if ( match.Success ) {
  Console.WriteLine(match);
  Console.WriteLine(“Area Number: “ + match.Groups[1]);
} else {
  Console.WriteLine("No match");
}

As intimidating as they may seem, the payoff using regular expressions is huge. And, since efficiency is what we strive for, they have a natural place in our bag of tools. So, learn the basics of regular expressions. You’ll be happy you did.

Finally some advice sprung from my own experience with regular expressions.

•   Get your regular expressions right first. Use Rubular, or an equivalent tool, for pain free experimentation before you implement them in your code.
•    Document your regular expressions. They are notoriously difficult to read so a short description with example match data is almost always a good idea.

Cheers!

Previous posts in the Tools of The Effective Developer series:

  1. Tools of The Effective Developer: Personal Logs
  2. Tools of The Effective Developer: Personal Planning
  3. Tools of The Effective Developer: Programming By Intention
  4. Tools of The Effective Developer: Customer View
  5. Tools of The Effective Developer: Fail Fast!
  6. Tools of The Effective Developer: Make It Work – First!
  7. Tools of The Effective Developer: Whetstones
  8. Tools of The Effective Developer: Rule of Three
  9. Tools of The Effective Developer: Touch Typing
  10. Tools of The Effective Developer: Error Handling Infrastructure
Categories: C#, ruby, software development Tags:
  1. RJ
    May 13th, 2010 at 01:28 | #1

    The thing that drives me nuts about regular expressions is that I hardly ever use them. I just don’t need to in 99% of my programming.

    However in that 1% when I do need them it invariably results in me going to one of the various “Introduction to Regex” sites and re-learning what I need to know and an hour or so of regular expressions.

    Maybe someone needs to create a paid for site where you submit what you need and get a regex back ;~)

  2. May 13th, 2010 at 08:49 | #2

    @RJ
    Ah yes, I know what you mean. However, I tend to use them often enough so that at least the basics sticks.
    A regex expert for hire site? That could fly 🙂

  3. cowardlydragon
    May 13th, 2010 at 20:49 | #3

    The #1 to remember about regular expressions is that they can solve the obvious use case of many problems, but once edge cases crop up, regular expressions lack the computational flexibility to handle them.

    They are also self-obfuscating.

    However, they are undeniably useful.

  4. May 13th, 2010 at 21:35 | #4

    @cowardlydragon
    Totally agree. Validating e-mail addresses is a good example of regexps failing to provide a perfect solution. And being “self-obfuscating” is the reason why it’s important to document the behavior, i.e. with comments.

  5. Joep
    May 14th, 2010 at 07:18 | #5

    I prefer to think of Regexes as state-machines, or more specifically nondeterministic state machines. http://en.wikipedia.org/wiki/Nondeterministic_finite_state_machine

    If you understand what an NDFA IS, I think regex becomes far easier to comprehend and use effectively.

  6. May 15th, 2010 at 08:27 | #6

    @Joep
    Absolutely, regular expressions originates from automata theory so if you want to master them you need to go to the NDA level. However, for the problems I usually deal with, thinking in terms of string patterns is quite enough. But then again, in the face of a more complicated expression I’m usually lost. Hm, next time that happens to me I think I’ll take your advice and visualize the problem with a state diagram 🙂
    Thank you for your comment.

  7. deeringc
    May 18th, 2010 at 14:12 | #7

    It’s actually much much worse for email matching if you want full compliance with RFC822.

    Beware: http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.

  8. Patrick
    May 20th, 2010 at 21:05 | #8

    @RJ ever seen http://txt2re.com/ ?

  1. May 13th, 2010 at 00:33 | #1
  2. May 13th, 2010 at 00:34 | #2