Open Source Adventures: Episode 81: Exploring Raku Regular Expression API
In previous three episodes I explored regular expression APIs of Ruby, Crystal, and Python, so let's finish this by doing the same exercise in Raku.
The problem is the same - there's multiple date formats, and we want to extract information from whichever one matches.
I'm doing it with just 3 regular expressions, but in real world there could be hundreds. Doing it naively with a list of regular expressions would require massive code duplication, and a lot of calls to regular expression engine, which is generally dramatically slower than just matching a|b|c|...
once.
The Problem
#!/usr/bin/env raku
use JSON::Fast;
for qw[2015-05-25 2016/06/26 27/07/2017] {
say to-json(parse_date($_), :!pretty)
}
And expected output is:
[2015,5,25]
[2016,6,26]
[2017,7,27]
Solution 1
sub parse_date($s) {
if $s ~~ /(\d\d\d\d)\-(\d\d)\-(\d\d)/ {
[+$0, +$1, +$2]
} elsif $s ~~ /(\d\d\d\d)\/(\d\d)\/(\d\d)/ {
[+$0, +$1, +$2]
} elsif $s ~~ /(\d\d)\/(\d\d)\/(\d\d\d\d)/ {
[+$2, +$1, +$0]
}
}
The first way is to just list every possible regular expression sequentially.
In case you're not familiar with Raku, here's a few minor things to notice:
- match groups are numbered from
$0
not from$1
- these variables aren't
String
type like in other languages, they'reMatch
- we can't
to-json
aMatch
object without some kind of conversion, either to string or ot number, so returning[$0, $1, $2]
would just crash +
converts to a number- match operator is
~~
, not=~
\d
is Unicode digit, not just0
to9
There's also a bigger thing to consider - in other languages, unknown punctuation like /
or -
can be used in regular expression literally. In retrospect this was a mistake, as it prevents adding new regular expression syntax without breaking backwards compatibility. Raku forces escaping every such punctuation, even currently unused, so at some point in the future it can make -
or /
have some meaning.
Anyway, just like solution 1 in other languages, this suffers from both problems - code duplication, and poor performance due to sequential match.
Solution 2
sub parse_date($_) {
if /(\d\d\d\d)\-(\d\d)\-(\d\d)/ {
[+$0, +$1, +$2]
} elsif /(\d\d\d\d)\/(\d\d)\/(\d\d)/ {
[+$0, +$1, +$2]
} elsif /(\d\d)\/(\d\d)\/(\d\d\d\d)/ {
[+$2, +$1, +$0]
}
}
Just like in Perl, we don't need to use ~~
, if we use regular expression in boolean context it will automatically match $_
.
Solution 3
sub parse_date($_) {
if /(\d\d\d\d)\-(\d\d)\-(\d\d)/ or /(\d\d\d\d)\/(\d\d)\/(\d\d)/ {
[+$0, +$1, +$2]
} elsif /(\d\d)\/(\d\d)\/(\d\d\d\d)/ {
[+$2, +$1, +$0]
}
}
We can reduce code duplication if groups are in the same order.
Solution 4
sub parse_date($_) {
if /(\d\d\d\d)\-(\d\d)\-(\d\d) | (\d\d\d\d)\/(\d\d)\/(\d\d)/ {
[+$0, +$1, +$2]
} elsif /(\d\d)\/(\d\d)\/(\d\d\d\d)/ {
[+$2, +$1, +$0]
}
}
What's going on here? In Raku $0
, $1
, $2
don't mean the Nth match group in the expression, it means Nth match group that actually matched (at least for |
alternatives, the full story is more complicated)!
This is great as we can put as many expressions as we want without any nonsense like +($0 or $3 or $6)
.
On the other hand, this means we have no way to do alternative if they're in a different order.
Solution 5
sub parse_date($_) {
if /(\d\d\d\d)\-(\d\d)\-(\d\d) |
(\d\d\d\d)\/(\d\d)\/(\d\d) |
(\d\d)\/(\d\d)\/(\d\d\d\d)/ {
[+$0, +$1, +$2]
}
}
This doesn't actually work. The block can't tell which branch matched, so it doesn't know if the groups are in YMD or DMY order.
Solution 6
sub parse_date($_) {
if /
$<y>=(\d\d\d\d) \- $<m>=(\d\d) \- $<d>=(\d\d) |
$<y>=(\d\d\d\d) \/ $<m>=(\d\d) \/ $<d>=(\d\d) |
$<d>=(\d\d) \/ $<m>=(\d\d) \/ $<y>=(\d\d\d\d)
/ {
[+$<y>, +$<m>, +$<d>]
}
}
We can solve this with named captures, and it works great.
Story so far
The syntax and API are both very different from traditional regular expressions, but in the end we got everything we needed.
Coming next
I was planning to write another post on how we could improve regular expression APIs, but it turns out the APIs we explored (except Python's disappointingly limited one) actually have features that cover most of what I wanted to say, even if they're rarely used in the real world yet.
So the next episode will be about something completely different.
Subscribe to my newsletter
Read articles from Tomasz Węgrzanowski directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
Tomasz Węgrzanowski
Tomasz Węgrzanowski
I know literally all programming languages.