Regex, the good bits.
There are two types of developers: those who fear regex because they don't understand it and those who abuse regex to flex on their millennial teammates.
The purpose of this blog is to get you somewhere in between. Know the bits that will be super useful without being dangerous.
Wait, regex can be dangerous?
Regex can do some spectacular things, you can write entire programs in regex. But just because you can, doesn't mean you should. A giant regex pattern uses all the powerful bits of regex, like recursive patterns, conditional patterns, look ahead and look behinds, and introducing side-effects with a replace.
I mean look at this:
(function(a,b){if(/(android|bb\d+|meego).+mobile|avantgo|bada\/|blackberry|blazer|compal|elaine|fennec|hiptop|iemobile|ip(hone|od)|iris|kindle|lge |maemo|midp|mmp|mobile.+firefox|netfront|opera m(ob|in)i|palm( os)?|phone|p(ixi|re)\/|plucker|pocket|psp|series(4|6)0|symbian|treo|up\.(browser|link)|vodafone|wap|windows ce|xda|xiino/i.test(a)||/1207|6310|6590|3gso|4thp|50[1-6]i|770s|802s|a wa|abac|ac(er|oo|s\-)|ai(ko|rn)|al(av|ca|co)|amoi|an(ex|ny|yw)|aptu|ar(ch|go)|as(te|us)|attw|au(di|\-m|r |s )|avan|be(ck|ll|nq)|bi(lb|rd)|bl(ac|az)|br(e|v)w|bumb|bw\-(n|u)|c55\/|capi|ccwa|cdm\-|cell|chtm|cldc|cmd\-|co(mp|nd)|craw|da(it|ll|ng)|dbte|dc\-s|devi|dica|dmob|do(c|p)o|ds(12|\-d)|el(49|ai)|em(l2|ul)|er(ic|k0)|esl8|ez([4-7]0|os|wa|ze)|fetc|fly(\-|_)|g1 u|g560|gene|gf\-5|g\-mo|go(\.w|od)|gr(ad|un)|haie|hcit|hd\-(m|p|t)|hei\-|hi(pt|ta)|hp( i|ip)|hs\-c|ht(c(\-| |_|a|g|p|s|t)|tp)|hu(aw|tc)|i\-(20|go|ma)|i230|iac( |\-|\/)|ibro|idea|ig01|ikom|im1k|inno|ipaq|iris|ja(t|v)a|jbro|jemu|jigs|kddi|keji|kgt( |\/)|klon|kpt |kwc\-|kyo(c|k)|le(no|xi)|lg( g|\/(k|l|u)|50|54|\-[a-w])|libw|lynx|m1\-w|m3ga|m50\/|ma(te|ui|xo)|mc(01|21|ca)|m\-cr|me(rc|ri)|mi(o8|oa|ts)|mmef|mo(01|02|bi|de|do|t(\-| |o|v)|zz)|mt(50|p1|v )|mwbp|mywa|n10[0-2]|n20[2-3]|n30(0|2)|n50(0|2|5)|n7(0(0|1)|10)|ne((c|m)\-|on|tf|wf|wg|wt)|nok(6|i)|nzph|o2im|op(ti|wv)|oran|owg1|p800|pan(a|d|t)|pdxg|pg(13|\-([1-8]|c))|phil|pire|pl(ay|uc)|pn\-2|po(ck|rt|se)|prox|psio|pt\-g|qa\-a|qc(07|12|21|32|60|\-[2-7]|i\-)|qtek|r380|r600|raks|rim9|ro(ve|zo)|s55\/|sa(ge|ma|mm|ms|ny|va)|sc(01|h\-|oo|p\-)|sdk\/|se(c(\-|0|1)|47|mc|nd|ri)|sgh\-|shar|sie(\-|m)|sk\-0|sl(45|id)|sm(al|ar|b3|it|t5)|so(ft|ny)|sp(01|h\-|v\-|v )|sy(01|mb)|t2(18|50)|t6(00|10|18)|ta(gt|lk)|tcl\-|tdg\-|tel(i|m)|tim\-|t\-mo|to(pl|sh)|ts(70|m\-|m3|m5)|tx\-9|up(\.b|g1|si)|utst|v400|v750|veri|vi(rg|te)|vk(40|5[0-3]|\-v)|vm40|voda|vulc|vx(52|53|60|61|70|80|81|83|85|98)|w3c(\-| )|webc|whit|wi(g |nc|nw)|wmlb|wonu|x700|yas\-|your|zeto|zte\-/i.test(a.substr(0,4)))window.location=b})(navigator.userAgent||navigator.vendor||window.opera,'http://detectmobilebrowser.com/mobile');
This was a somewhat commonly used pattern to detect mobile browsers at some point.
My problem with regex is this:
Regex syntax is extremely concise, which means extreme mental-overhead to read and write.
Regex has lots of exceptions. It's grammar and rules are inconsistent at best.
Really hard to split into multiple lines and not endup with a huge cursed string.
Updating regex to accept new behavior, reuse bits of the logic is hard.
When it gets large enough, everyone's afraid to touch it.
You will end up with impossible to read and maintain code if you go too far with regex.
On to the good bits
At its core, regex is a powerful way to search and match text based on rules, and extract information into variables. It can be used to manipulate string, but I'm going to avoid this. Most people do not expect regex to have side effects.
Stuff like pulling out a html tag with certain classnames, formatting phone numbers, and log parsing are great examples of good places to use regex.
Basic patterns
Take this example of using regex in JS.
const words = [
"Hello world",
"This is a short! message that says \"Hello world\"",
"I love regular expressions"
];
// this is the pattern
const re = new RegExp("ello");
// the pattern can be used to "test" for matches
console.log(words.filter(value => re.test(value)))
The regular expression pattern ello
used in re.test()
will match any string containing the pattern as a substring. This is the simplest type of pattern.
It will match the following lines:
[ 'Hello world', 'This is a short! message that says "Hello world"' ]
These patterns are by default case sensitive. You can define case in-sensitive with the option new RegExp("ello", "i");
.
Start and end of text
Regular expressions has "meta characters" that define logical rules for your pattern to match against.
The character ^
means beginning of text and $
means end of string.
For example:
// This will match "Hello, world!" but not "Message: Hello, world!".
const re = new RegExp("^Hello");
// This will match "Hello, world" but not "Hello, world!"
const re = new RegExp("world$");
Match one of variations
Sometimes you want to match variations of a similar pattern. In the most basic cases, variations of words like fine
, pine
, and line
. In these cases, you can define a group of options in brackets like this: []
.
const words = [
"fine",
"pine",
"line!"
];
// This will match all the words above.
const re = new RegExp("[fpl]ine");
console.log(words.filter(value => re.test(value)))
You can use ranges of ascii characters in these any-of groups like this [a-zA-Z0-9]
.
const words = [
"1ine",
"Pine",
"zine!"
];
// still match all the words.
const re = new RegExp("[a-zA-Z0-9]ine");
console.log(words.filter(value => re.test(value)))
An alternative approach is to use |
which represents a logical or to match alternatives.
const words = [
"color",
"colour",
];
const re = new RegExp("color|colour");
console.log(words.filter(value => re.test(value)))
Wildcard
Sometimes you don't want to specify options, you want to match every variation imaginable. We can use the .
character to specify a wildcard.
const words = [
"%ine",
"}ine",
"`ine!"
];
const re = new RegExp(".ine");
console.log(words.filter(value => re.test(value)))
Repeating patterns
Sometimes we want to match a character repeatedly. For example, matching every variation of yeet
, like yeeeeeeeeeet
or yeeeeeeeeeeeeeeeeeeeeeeeet
.
We can use +
or *
. *
matches the preceding element zero or more times, +
Matches the preceding element one or more times.
const words = [
"yeet",
"yeeet",
"yeeeeeeeeet"
];
// matches [ 'yeeet', 'yeeeeeeeeet' ]
const re1 = new RegExp("yeee+t");
// matches [ 'yeet', 'yeeet', 'yeeeeeeeeet' ]
const re2 = new RegExp("yeee*t");
console.log(words.filter(value => re1.test(value)))
console.log(words.filter(value => re2.test(value)))
Use +
if you want to match the character at least once, use * to mean it's optional but try to match many times if possible.
A interesting side effect of this is that they can be combined with the wildcard .
. Try .*
and .+
in your patterns, but becareful, .*
will match literally anything which can be very error prone.
Another useful piece of syntax is {}
which specifies the number of times a character or part of a pattern is repeated. For example:
const words = [
"100",
"1011",
"222222"
];
// matches [ '1011' ]
const re = new RegExp("^[0-9]{4}$");
console.log(words.filter(value => re.test(value)))
Useful macros
There are some metacharacters that behave kinda like macros. These metacharacters are fundamental in constructing regex patterns to match specific text patterns in strings.
Metacharacter | Description | Example Match |
\d | Digit (0-9) | 4 , 9 , 0 |
\D | Non-digit | a , Z , % |
\w | Word character | a , A , 1 , _ |
\W | Non-word character | ! , @ , # |
\s | Whitespace | , \t , \n |
\S | Non-whitespace | a , 1 , % |
\b | Word boundary | \bword\b , \b123\b |
\B | Non-word boundary | \Bword\B , \B123\B |
\d
, \w
, and \s
are pretty self-explanatory. What I wanna focus on is the \b
and \B
meta characters. These are extremely useful when parsing prose, because they respect natural word boudaries. For example in "hello, boss", hello is a independent word, but it's followed by a ,
which means if you naively matched the pattern \shello\s
, the word will be missed. Similarly, matching for hello
naively will also match words like phelloplastics
.
const words = [
"hello, boss",
"galvanized square steel.",
"dave has saved up a looooonnnggg time for his new prison-esque house."
];
// matches words ["hello, boss"]
const re1 = new RegExp("\\bhello\\b");
// matches in words ["dave has saved up a looooonnnggg t...]
const re2 = new RegExp("\\Bsq\\B");
console.log(words.filter(value => re1.test(value)))
console.log(words.filter(value => re2.test(value)))
Extracting values
Capture groups in regex let you extract values with a patter. You define capture groups with (<subpattern>)
and everything matched by the pattern enclosed in ()
is returned.
For example, parsing an email:
let email = "example.user123@example.com";
let r = /^([\w\.-]+)@([\w\.-]+)\.([a-zA-Z]{2,6})$/;
let match = r.exec(email);
if (!match)
throw Error("invalid email")
// the capture group is returned as an array of matches.
let username = match[1] ?? '';
let domain = match[2] ?? '';
let tld = match[3] ?? '';
console.log("Username:", username);
console.log("Domain:", domain);
console.log("Top-Level Domain (TLD):", tld);
What's even cooler is that capture groups can be named for more readable patterns and they can be used to match multiple times. For example, here we can extract multiple emails:
let email = "example.user123@example.com example.user123@example.com example.user123@example.com";
let pattern = /(?<username>[\w\.-]+)@(?<domain>[\w\.-]+)\.(?<tld>[a-zA-Z]{2,6})/;
let match = pattern.exec(email);
if (!match)
throw Error("invalid email")
// loop over all matches
for (group in match.groups) {
let username = match.groups.username;
let domain = match.groups.domain;
let tld = match.groups.tld;
console.log("Username:", username);
console.log("Domain:", domain);
console.log("Top-Level Domain (TLD):", tld);
}
Wait, you missed this cool thing!
If you already know a lot of regex, great! I know everyone has their favorite little tricks with regex. The point of this post is to remove the fear many developers feel when they see regex in code. I think the subset of regex introduced in this post give you more than enough to be powerful and literate in regex, but not enough to become abusive.
Of couse, if you feel like there's something cool others should know that I missed, leave it in the comments!
Cool stuff you can do
I've seen engineers do crazy things with grep. At a old job where we wrote realtime operating systems (ancient 30 year-old code bases), I saw entire chunks of our build process written in sed
and awk
which relies heavily on regex.
Other cool things you can do is write your own linters. I work at Trunk, and we make a thing called Trunk Check where you can write grep linters in less than a minute if you know your regex patterns.
Play around and share the cool/crazy stuff you make with the internet.
Subscribe to my newsletter
Read articles from Vincent Ge directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by