100 Languages Speedrun: Episode 47: Raku (Perl 6) Regular Expressions

What gets to be included in languages and what gets pushed into some third party library is a result of history more than reason.

For example pretty much every language comes with every possible trigonometric function included and preloaded, so you can run Math.asinh(69.0) without even requiring anything. What was the last time you wrote a program that needed Math.asinh in your program?

Meanwhile string matching and processing is about the most common thing programs do, but before Perl I don't think any general purpose language included regular expressions. It was something either text processing languages like sed and awk did, or left for third party libraries. Perl fully embraced them, made them far more powerful, and in the post-Perl era including regular expressions is just a normal natural thing languages do.

There's a similar story with package managers. First they didn't exist. Then they existed as poorly integrated third party tools like Ruby's rubygems, and JavaScript's npm. Nowadays, we expect every new language to simply have fully functioning package manager builtin.

Anyway, since days of Perl, Perl Compatible Regular Expressions, or their very close variant, and the default regular expression engine for new languages. Nobody serious considers using any of the pre-Perl regular expression systems, with their limitations and irregularities (but if you want to try some, UNIX grep command is still using stupid 1970s style regexps by default - even a|b is broken!). You can see some comparison chart here, the pre-Perl and post-Perl divide is really apparent, even if they disagree on a few issues.

The only exception seems to be Raku (originally known as Perl 6)(taw.hashnode.dev/100-languages-speedrun-epi..), which decided to just design its own regular expression syntax, and it's about this sublanguage this episode is about.

I found a massive bug in Raku Regular Expressions

Before I start, let's just get it out of the way - Raku has a massive design bug in its regular expressions:

#!/usr/bin/env raku

sub is_small_int($n) {
  !!($n ~~ /^ \d ** {1..6} $ /)
}

my @examples = (
  # Correctly True
  '0',
  '0001',
  '12345',
  # Correct False
  '-17',
  '1234567',
  '3.14',
  # Not ASCII digits
  '๓๓๓',
  '௫๓௫๓',
  '១๑໑',
);

for @examples -> $n {
  say "is_small_int($n) = ", is_small_int($n);
}

What it does:

./raku_bug.raku
is_small_int(0) = True
is_small_int(0001) = True
is_small_int(12345) = True
is_small_int(-17) = False
is_small_int(1234567) = False
is_small_int(3.14) = False
is_small_int(๓๓๓) = True
is_small_int(௫๓௫๓) = True
is_small_int(១๑໑) = True

The last three examples are just plain incorrect.

The bug didn't take long to find - documentation for Raku regular expressions literally says this broken way is how \d works in Raku.

So why I'm saying it's a bug? Because regular expressions need to be able to process computer data, and matching a digit (ASCII 0 to 9) is about the second most common thing to do after matching a literal characters. In the entire history of regular expressions, I don't know if there was even one case when someone actually wanted to match Unicode digits, and it's not like their job was hard, Raku has zero problems matching by Unicode properties:

#!/usr/bin/env raku

sub is_unicode_digits($n) {
  !!($n ~~ /^ <:N> ** {1..6} $ /)
}

my @examples = (
  # Correctly True
  '0',
  '0001',
  '12345',
  # Correct False
  '-17',
  '1234567',
  '3.14',
  # Non ASCII digits
  '๓๓๓',
  '௫๓௫๓',
  '១๑໑',
);

for @examples -> $n {
  say "is_unicode_digits($n) = ", is_unicode_digits($n);
}

If I said that wanting to match [0-9] is approximately a MILLION times more common than wanting to match <:N>, I'd be massively understating my case. We might be dividing my zero and getting an infinity here.

And just to show that I'm right, the very same document that defines how \d works, then proceeds to casually assume \d will match ASCII numbers 0 to 9, with examples such as:

/ <[\d] - [13579]> /;
s/ (\d+)\-(\d+)\-(\d+) /today/;
s/ (\d+)\-(\d+)\-(\d+) /$1-$2-$0/;
my regex ipv4-octet { \d ** 1..3 <?{ True }> }
my regex number { \d+ [ \. \d+ ]? }

So without any doubt, Raku \d and \D are completely 100% broken, and hopefully they fix it, as broken \d means basically every regular expression will either be incorrect and potentially introduce security vulnerabilities, or people learn to avoid \d and use the extremely verbose <[0..9]> instead.

This is not a trivial problem. By a quick greps for regexps in a few codebases in a few languages, \d is indeed the most ubiquitous regexp escape code, and it's supposed to mean ASCII digits 0 to 9 every single time.

Raku didn't even come with this bug, another language was doing the same broken thing before. It's still 100% unquestionably broken.

Regular expression basics

Anyway, now that we got it out of the way, let's talk about Raku regular expressions basics.

Traditional regular expressions really overloaded a few special characters and their combinations to mean so many different things, so when a new feature was added, it had to use more and more nasty combination of same few special characters. Raku does a big restart, making some common regular expressions more verbose, but now it has a lot more syntax to work with.

As expected, regular expressions go between slashes //. You can match them with ~~. A few common operations like substitution s/// have extra syntax too.

#!/usr/bin/env raku

my $s = "Hello, World!";
say "We are saying Hello" if $s ~~ /Hello/;

$_ = "Hello, World!";
say "We are saying Hello" if m/Hello/;

# Spaces are ignored by default on the regexp side
# but not on substitution side
$_ = "Hello, World!";
s/ World /Alice/;
say $_;

# :i for case insensitive
$_ = "Hello, World!";
s:i/ world /Alice/;
say $_;

my $n = "Alice";
say "It is Alice" if $n ~~ regex {
  ^       # start of string
  (A | a) # lower or upper case A
  l       # lower case l
  i       # lower case i
  c       # lower case c
  e       # lower case e
  $       # end of string
}

There are a few obvious changes:

  • spaces are ignored by default, so you can make regular expressions a lot more readable, with spacing, comments, and so on
  • switches go on the beginning not the end
  • ^ and $ are start and end of string, with no line stuff, and that's honestly a much more sensible default than complex rules traditional regular expressions had
  • many of the common things like | and () work just the same

Character classes

Raku decided that very common task of a non-grouping match should get [foo] instead of (?:foo). This meant that character classes now needed something more verbose so [0-9] is now <[0..9]>.

#!/usr/bin/env raku

my $number_regexp = rx/
  ^
  '-'?
  <[0..9]>+
  [
    '.'
    <[0..9]>+
  ]?
  $
/;

my @examples = (
  # Numbers
  '0004',
  '-123',
  '1234.5678',
  '-3.14',
  # Not numbers
  '1.2.3',
  '.8',
  '-5.',
  '๓๓๓',
  '௫๓௫๓',
  '១๑໑',
);

for @examples -> $n {
  say $n, ($n ~~ $number_regexp) ?? " is a number" !! " is NOT a number";
}
$ ./classes.raku
0004 is a number
-123 is a number
1234.5678 is a number
-3.14 is a number
1.2.3 is NOT a number
.8 is NOT a number
-5. is NOT a number
๓๓๓ is NOT a number
௫๓௫๓ is NOT a number
១๑໑ is NOT a number

Character class operations

This makes it possible to do some operations on character classes, like + (already possible with traditional regexp with just concatenation) and - (not directly doable).

#!/usr/bin/env raku

# Some letters are too easy to confuse with numbers, filter them out
my $nice_letter_rx = rx/ ^ <[A..Z] + [a..z] - [lIO] > $/;

my @examples = ('a'..'z', 'A'..'Z', '0'..'9').flat;

for @examples -> $c {
  say $c, " is not a nice letter" unless $c ~~ $nice_letter_rx;
}
./classes_math.raku
l is not a nice letter
I is not a nice letter
O is not a nice letter
0 is not a nice letter
1 is not a nice letter
2 is not a nice letter
3 is not a nice letter
4 is not a nice letter
5 is not a nice letter
6 is not a nice letter
7 is not a nice letter
8 is not a nice letter
9 is not a nice letter

Repetition

Traditionally repetition of A to B times used {A,B} syntax. Raku syntax is more verbose but it has more features. Let's start with the basic case. Also notice how special characters generally need to be quoted if you want to use them literally.

#!/usr/bin/env raku

my $rx = rx/
  ^
  <[0..9]> ** {1..3}
  '.'
  <[0..9]> ** {1..3}
  '.'
  <[0..9]> ** {1..3}
  '.'
  <[0..9]> ** {1..3}
  $
/;

my @examples = (
  '127.0.1',
  '8.8.8.8',
  '127.0.0.420',
  '127.0.0.9001',
);

for @examples -> $n {
  say $n, ($n ~~ $rx) ?? " looks like IP address" !! " does NOT look like IP address";
}
$ ./repetition.raku
127.0.1 does NOT look like IP address
8.8.8.8 looks like IP address
127.0.0.420 looks like IP address
127.0.0.9001 does NOT look like IP address

Raku supports "repetition with separator" syntax X ** {2,4} % Y means 2-4 Xs, with Ys in between them:

#!/usr/bin/env raku

my $rx = rx/
  ^
  [ <[0..9]> ** {1..3} ] ** 4 % '.'
  $
/;

my @examples = (
  '127.0.1',
  '8.8.8.8',
  '127.0.0.420',
  '127.0.0.9001',
);

for @examples -> $n {
  say $n, ($n ~~ $rx) ?? " looks like IP address" !! " does NOT look like IP address";
}

This is especially useful if the thing matched is more complex. How many times you wished you were able to do something like this?

#!/usr/bin/env raku

my $rx = rx/
  ^
  [
  | <[0..9]>            # 0-9
  | <[1..9]> <[0..9]>   # 10-99
  | 1 <[0..9]> ** 2     # 100-199
  | 2 <[0..4]> <[0..9]> # 200-249
  | 25 <[0..5]>         # 250-255
  ] ** 4 % '.'
  $
/;

my @examples = (
  '127.0.1',
  '8.8.8.8',
  '127.0.0.420',
  '127.0.0.9001',
);

for @examples -> $n {
  say $n, ($n ~~ $rx) ?? " looks like IP address" !! " does NOT look like IP address";
}

Notice extra validation:

$ ./ipv4.raku
127.0.1 does NOT look like IP address
8.8.8.8 looks like IP address
127.0.0.420 does NOT look like IP address
127.0.0.9001 does NOT look like IP address

There's also %% which allows for an optional trailing delimiter.

In ( a | b | c ) or [ a | b | c ] alternation you can put an extra initial | for formatting and it is ignored (it does not match empty).

Divides by 3

Regular expressions can be recursive with <~~>.

Let's do something that's a lot more difficult with traditional regexps, checking if a number divides by 3:

#!/usr/bin/env raku

my $divides_by_three_rx_part = rx/
  [
  | <[0369]>                              # 0
  | <[147]> <~~>? <[258]>                 # 1+2
  | <[147]> <~~>? <[147]> <~~>? <[147]>   # 1+1+1
  | <[258]> <~~>? <[147]>                 # 2+1
  | <[258]> <~~>? <[258]> <~~>? <[258]>   # 2+2+2
  ]
  <~~>?
/;
my $divides_by_three_rx = /^ $divides_by_three_rx_part $/;

for 1234560..1234579  {
  say $_, ($_ ~~ $divides_by_three_rx) ?? " divides by 3" !! " does NOT divide by 3";
}
$ ./divisible_by_three.raku
1234560 divides by 3
1234561 does NOT divide by 3
1234562 does NOT divide by 3
1234563 divides by 3
1234564 does NOT divide by 3
1234565 does NOT divide by 3
1234566 divides by 3
1234567 does NOT divide by 3
1234568 does NOT divide by 3
1234569 divides by 3
1234570 does NOT divide by 3
1234571 does NOT divide by 3
1234572 divides by 3
1234573 does NOT divide by 3
1234574 does NOT divide by 3
1234575 divides by 3
1234576 does NOT divide by 3
1234577 does NOT divide by 3
1234578 divides by 3
1234579 does NOT divide by 3

We still needed to do that in two parts as anchors are not part of the recurssion. I'm not sure if it's possible to do it with some : modifier, none of them seem to match.

FizzBuzz

This lets us do the holy grail of regular expressions, the FizzBuzz regexp! For comparison, we did it with traditional regexp back in the Sed episode(taw.hashnode.dev/100-languages-speedrun-epi..), and it was far more complex and completely unreadable. This one makes a lot of sense.

We just need one really useful feature - a regexp that two regexps match. / A && B / matches if both A and B match. In this case we have regexps for divisible by 3 and a very simple one for divisible by 5. Thanks to && it's really easy to get divisibility by 15 from it.

#!/usr/bin/env raku

my $rx3_part = rx/
  [
  | <[0369]>                              # 0
  | <[147]> <~~>? <[258]>                 # 1+2
  | <[147]> <~~>? <[147]> <~~>? <[147]>   # 1+1+1
  | <[258]> <~~>? <[147]>                 # 2+1
  | <[258]> <~~>? <[258]> <~~>? <[258]>   # 2+2+2
  ]
  <~~>?
/;
my $rx3 = /^ $rx3_part $/;
my $rx5 = /^ <[0..9]>* <[05]> $/;
my $rx15 = / $rx3 && $rx5 /;

for 1..100 -> $n {
  # In Raku we need to convert Int to Str, otherwise can't s/// it
  # In Perl it would magically change type for us
  $_ = "$n";
  s/^ $rx15 $/FizzBuzz/;
  s/^ $rx5 $/Buzz/;
  s/^ $rx3 $/Fizz/;
  say $_;
}
$ ./fizzbuzz.raku
1
2
Fizz
4
Buzz
Fizz
7
8
Fizz
Buzz
11
Fizz
13
14
FizzBuzz
16
17
Fizz
19
Buzz
...
Fizz
97
98
Fizz
Buzz

Should you use Raku Regular Expressions?

Regular expressions have a lot of features, so I could keep going, but these are likely the features you'd use the most.

I think most of the changes are sensible. Allowing free spacing and comments by default was much needed (most languages have some kind of support for it with //x etc., but because x goes on the end this causes a lot of parser confusion). Changing ^$ to be just the start and the end of the string with no special logic was a great change. && was much needed, ** % and ** %% are very clever shortcuts for something very common, recursion can simplify a lot of regexps, [] for non-matching grouping is quite nice, and so on.

Of course all this needs to be balanced by \d being completely broken, and \d is about the most commonly used regex feature. The good thing is that it can be fixed in 100% compatible way! Just make \d match 0 to 9 and nothing else. Not only it will not break any software, as nobody in history ever relied on this broken \d behavior, but it will likely fix a lot of bug, and likely many security vulnerabilities as well.

It either gets fixed, or you'd need to keep telling people to never ever use \d, and good luck with that.

So if you're designing a new language and its regular expression system, you should definitely consider doing changes similar to what Raku did. But keep \d correct please.

Also, this is likely not going to be the final Raku episode, as Raku Grammars are another sublanguage I want to cover in this series.

Code

All code examples for the series will be in this repository.

Code for the Raku Regular Expressions episode is available here.