Readable Regular Expressions
Introduction
At a previous job, I was asked to participate on a technical interview of a new developer candidate. Mid-way through the interview our manager joined. I was new to the company myself, and when it was my turn to ask questions, I asked the candidate if he was comfortable using regular expressions, and if he could write one to validate an email address. I thought this was a reasonable question, because I use regular expressions just about every day in my development work. Sometimes I use them in program code, and sometimes I use them to search or edit code or data. I use them in editors like vim, and in text processing tools like awk and sed, and in programming languages like Ruby and JavaScript. The candidate had a little trouble with the email validation, so I tried to assist him when he was struggling. (I understand how someone gets flustered during an interview, and might not demonstrate their true abilities effectively, just because they were put on the spot or were a bit nervous.) After the interview, one of my coworkers chided me for asking questions about regular expressions as though that was an unreasonable skill to expect a developer to have. But, my manager thanked me for asking good questions, and a day or so later, all the developers, including myself, received an email from our manager stating that going forward, I would be the lead interviewer for all developer positions. I had been at the company only a month or so at that time, and was the newest developer on the team. So, I’m thinking “why did my manager like my interview questions, and my coworker dislike them?” It led me to think that maybe this coworker, who was a very capable, senior level developer, wasn’t comfortable using regular expressions herself. I understand why some developers avoid regular expressions. They can be challenging to learn, and tend to be hard to read, because they are terse and difficult to parse, except for the simplest ones, but they also provide the most elegant solutions to some programming problems.
Understanding Regular Expressions
If you haven’t seen them before, you might be wondering, what regular expressions are. The phrase “regular expression” is derived from automata theory, but the name doesn’t describe what they do very well, and they are commonly just called 'regexes' for short. Exploring the origin of the name won’t really help much in understanding them, but for a working definition, just think of regular expressions as “character sequences that describe text patterns we want to search for”. The character sequences are put between forward slashes in most programming languages. For example, the regular expression /at/ will match the ‘at’ in “bat”, “cat”, “latter” , and “ate”, but we can do much more than simple matches like these. Let’s say we wanted to find every <h2> tag that contains content with more than five words. This will do it, matching the enclosing h2 tags and their content:
/<h2>(?:\s*\b\w+\b\s*){6,}.*?<\/h2>/
Practical Uses of Regular Expressions
There are many things we can do with the results of a regex match. We might just want a count of the matches or we might want to replace the matched text with different text. For example, we could change all <h2> tags to <h3> or maybe add a class to the <h2> tags like this:
<h2 class='section_title'>
Or, what if we wanted to parse a file and retrieve all the currency references, like $14.23. Here’s a regular expression that will find them:
/(?<!\S)\$(?:0|[1-9]\d{0,2})(?:,\d{3})*(?:\.\d{2})?(?!\S)/
The Readability Challenge
As you can see, once we go beyond simple literal matches, regular expressions can become difficult to read. Sometimes we can simplify long regular expressions. Our currency matcher above can be rewritten like this:
/(?<!\S)\$\d{1,3}(?:,\d{3})*(?:\.\d{2})?(?!\S)/
but that might not be immediately evident by just scanning the original expression, and it's easy to make mistakes doing this refactoring. Subtle changes to a regex can drastically change its behavior. In this post I will address the readability problem.
Learning Regular Expressions
I had intended to include an introduction to regular expressions in this post, but I found an excellent Youtube video series for learning them, so I will just include that link here instead:
Simplified Regular Expressions
Addressing the Readability Problem
To address the readability problem, some regex engines support a free-spacing mode (sometimes called verbose or extended mode) that will permit adding spaces for readability. Here's our currency regex with free-spacing mode turned on (by the x after the closing forward slash), and spaces added between different parts of the regex.
/(?<!\S) \$ \d{1,3} (?:,\d{3})* (?:\.\d{2})? (?!\S)/x
These spaces are ignored by the regex engine. Ruby, Perl, and PHP use the x flag at the end of the regex to enable free-spacing mode. Other languages use different syntaxes. I will continue to use Ruby syntax throughout this post for its simplicity. You can easily look up the syntax for other languages you are interested in. JavaScript doesn't support free-spacing mode natively, but you can add it by importing the XRegExp library.
We can take further advantage of free-spacing mode by using newlines instead of single space characters to transform our regex into this:
/
(?<!\S)
\$
\d{1,3}
(?:,\d{3})*
(?:\.\d{2})?
(?!\S)
/x
and we can add comments, like this:
/
(?<!\S) # Negative lookbehind: ensure no non-whitespace character before
\$ # Match a literal dollar sign
\d{1,3} # Match 1 to 3 digits
(?:,\d{3})* # Match 0 or more groups of comma followed by exactly 3 digits
(?:\.\d{2})? # Optionally match a decimal point followed by exactly 2 digits
(?!\S) # Negative lookahead: ensure no non-whitespace character after
/x # Enable free-spacing mode
I think you can see where I'm going with this. Our regex is now much easier to understand, and if we need to make a modification, it's pretty clear where to do it. For example, if I wanted to also match currency amounts using the € symbol for euros, I would change the regex to this:
/
(?<!\S) # Negative lookbehind: ensure no non-whitespace character before
[$€] # Match either a dollar sign or a euro sign
\d{1,3} # Match 1 to 3 digits
(?:,\d{3})* # Match 0 or more groups of comma followed by exactly 3 digits
(?:\.\d{2})? # Optionally match a decimal point followed by exactly 2 digits
(?!\S) # Negative lookahead: ensure no non-whitespace character after
/x # Enable free-spacing mode
Creating Regular Expressions in Code
In most programming languages you can create a regular expression from a string. For example, in Ruby, we could do this for our earlier example:
regex_string = '(?<!\S) \$ \d{1,3} (?:,\d{3})* (?:\.\d{2})? (?!\S)'
regex = Regexp.new(regex_string, Regexp::EXTENDED)
This is convenient when you want to build a regex programmatically. You might be wondering why you would ever need to do this, so here is an example use case.
Use Case: Generating Numbers with Distinct Odd-Even Digit Patterns
Assume we have a number generator that produces 10-digit numbers for us. It's not important how the numbers are generated, we just need to validate the newly generated candidate numbers against all previously saved numbers. We need numbers that are unique in terms of their odd-even digit pattern. Each number's pattern is determined by its sequence of odd and even digits. For example, the number 5234556780 has the pattern OEOEOOEOEE, where 'O' represents an odd digit and 'E' represents an even digit. The goal is to save only the newly generated numbers which have an odd-even pattern that isn't the same as any existing number's pattern. When a number is validated, that is, its odd-even pattern of digits is unique, we save it to our database. This criterion means that numbers with different digits but identical odd-even patterns are considered incompatible. So, 5234556780 would be incompatible with 7090174522, because they both have the odd-even digit pattern OEOEOOEOEE. This example might seem contrived, but it's actually a simplified use case from a real production application.
Conceptually, building the regex consists of:
mapping each newly generated number's digits to 'O' for odd or 'E' for even, then
replacing 'O' with [13579] and 'E' with [02468]
Coding this algorithm, we can omit the step of mapping the digits to 'O' or 'E', and map them directly to the appropriate character set. In the Ruby code below, number is our new candidate number:
number = 5234556780
regex_string = number.to_s.gsub(/\d/) { |d| d.to_i.odd? ? '[13579]' : '[02468]' }
odd_even_regex = Regexp.new(regex_string)
In case you're not familiar with Ruby, here's a breakdown of the line of code that defines regex_string:
convert the number to a string using the to_s method, so we can use the String method gsub
using gsub, test each digit for oddness:
if it's odd, replace it with the character set [13579],
otherwise replace it with the character set [02468]
For 5234556780, the above code would assign odd_even_regex like this:
odd_even_regex = /[13579][02468][13579][02468][13579][13579][02468][13579][02468][02468]/
We then check all the existing numbers for a match. If a match is found, we discard this number and move on to the next. Here's what the regex test looks like in Ruby:
existing_numbers.grep(odd_even_regex).empty?
where existing_numbers is an array containing all the previously saved numbers. If the above line of code returns true, meaning the array of matches is empty, we save the new number.
This technique of building regexes from strings can also be used as an alternative to free-spacing mode for languages that don't support it. You can build the multiline regex string, formatted any way you like, then just process out the comments and spaces before giving it to the regex constructor function. Here's a brief JavaScript example for matching three-digit numbers (from 100 to 999):
const regexString = `
[1-9] // Hundreds place, no leading zero
\\d // Tens place, any digit
\\d // Ones place, any digit
`.replace(/\/\/.*$/gm, '').replace(/\s+/g, ''); // remove comments and unnecessary whitespace
const regex = new RegExp(regexString);
Notice that we used a regex in each of the replace() functions, first to remove comments and second to remove whitespace from regexString, then we used regexString to instantiate our desired regular expression, /[1-9]\d\d/.
Tools for Refactoring and Testing Regular Expressions
So, now I've addressed how to write readable regular expressions, but what if you encounter an existing one that is not formatted nicely? You could refactor it, but this is somewhat prone to error, because it still requires you to manually parse the regex so you know where to add spaces and comments. My preferred solution is to let an AI tool refactor it. If you don't already have a favorite AI tool, you can use Perplexity (https://www.perplexity.ai) without an account. Here is an example prompt you can use:
Refactor this regular expression to make it more readable by using the free-spacing mode:
<put your regex here>
By default Perplexity will add comments to the refactored regex, but you can tell it to forgo that if you prefer.
It's important to test the AI refactored regex. You can test your regexes using the Regular Expressions 101 site (https://regex101.com). Note that sometimes regex101 shows an error if you paste in the slashes that delimit the start and end of the regex. You can also use Perplexity to help you debug regexes.
Conclusion
Whether you are new to software development or have been doing it for a long time, I encourage you to learn about regular expressions. The important thing is to learn something new about them and begin using what you’ve learned right away. After you’ve mastered one feature, learn another feature and begin using it. Once you are comfortable with them, I think you’ll find many uses for regular expressions.
Subscribe to my newsletter
Read articles from Brian Ness directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
Brian Ness
Brian Ness
I began my career as a software engineer with Cray Research, primarily focused on software tools and processes. There, I developed a passion for elegant design. If our tools and processes aren't fun to use, there's probably a better way to work. I like to streamline things that aren't efficient, and find better ways to develop software.