Open Source Adventures: Episode 77: Improving Regular Expressions API

The goal of parsing is turning unstructured text input into some useful data.

The most popular parsing system is regular expressions. Unfortunately APIs to integrate regular expression libraries with their host programming languages are stuck in the past, and the best you can get from them is a fixed list of matched substrings (or match indexes).

I'll be using Ruby for code examples, but the idea applies to most programming languages.

How it works currently

Let's start with a simple regular expression for matching a list of numbers separated by commas:

/(\d+)(?:,(\d+))*/

There are three groups here. The first group matches a number and goes into $1. The second group is non-matching (marked with ugly (?:...) - that's one of the things Raku changed). The third one matches a number and goes into $2.

If we try to use it "1,2,3,4,5" =~ /(\d+)(?:,(\d+))*/ what we'd is $1 being "1" and $2 being "5".

That $2 group actually matched four times - matching "2", "3", "4", and "5", however with current API we only get the last match, and all others are thrown away. We'd like to use this information.

Current workarounds

This kind of parsing happens all the time, so there are some well established workarounds.

The most common workaround is parsing in multiple stages. So first we'd extract whole groups like /(\S+): (\d+(?:,\d+)*)/ to turn "Bob: 10,20,30" into "Bob" and "10,20,30", then process that second part to turn it into ["10", "20", "30"].

Then once we only have the relevant part, we can extract list of matches with a loop, with something like .scan to get all matches (or its //g equivalent in many other languages), or using a special purpose functions like .split for it.

Wouldn't it be nice if we didn't need any such workarounds?

Return list of all matches

The first change we could try is just return every match for every () group. If first () matched once, and () matched four times, then first one gets returned as $1, second as $2, $3, $4, and $5.

In this version instead of this ($~.to_a returns full match, then all numbered matches):

"1,2,3,4,5" =~ /(\d+)(?:,(\d+))*/
p $1 # => "1"
p $2 # => "5"
p $~.to_a # => ["1,2,3,4,5", "1", "5"]

We'd have this:

"1,2,3,4,5" =~ /(\d+)(?:,(\d+))*/
p $1 # => "1"
p $2 # => "2"
p $3 # => "3"
p $4 # => "4"
p $5 # => "5"
p $~.to_a # => ["1,2,3,4,5", "1", "2", "3", "4", "5"]

This has an advantage of having the same API, and it covers at least some additional cases.

Unfortunately we'd also lose some functionality. The current API lets us know which alternative matched:

"hello" =~ /([a-z]+)|(\d+)/
$~.to_a #=> ["hello", "hello", nil]
"420" =~ /([a-z]+)|(\d+)/
$~.to_a # => ["420", nil, "420"]

Changing matches to be only the ones that actually happened would not let us distinguish between these cases.

This API also wouldn't be able to handle more complex cases. If we had two such lists in one regexp, we would have no way of knowing which of the returned matches comes from which list.

Return all matches at any given location

OK, so what if we abandoned current API, and instead of each match variable being a String or a nil, we'd have it be a list of all matches, if it matched multiple times.

"1,2,3,4,5" =~ /(\d+)(?:,(\d+))*/
p $1 # => "1"
p $2 # => ["2", "3", "4", "5"]
p $~.to_a # => ["1,2,3,4,5", "1", ["2", "3", "4", "5"]]

This API would cover a lot of extra cases, including any number of such lists, and it wouldn't break current | use case.

There's a bit of a type issue here. What should these return? For 2+ elements it's obviously an array, but there are two options for 0 and for 1 elements:

"1" =~ /(\d+)(?:,(\d+))*/ returns ["1", nil] or ["1", []]?
"1,2" =~ /(\d+)(?:,(\d+))*/ returns ["1", "2"] or ["1", ["2"]]?
"1,2,3" =~ /(\d+)(?:,(\d+))*/ returns ["1", ["2", "3"]]

To preserve current behavior for alternatives and for ()?, we'd need to return nil, "2", ["2", "3"], but that's really awkward API to use as you need to deal with 3 types.

Doing some static analysis of the regex can determine if group can have multiple matches or not, in that case, we'd return [], ["2"], and ["2", "3"].

But exact analysis is not trivial. For example (\d+)? would return nil or String, but since (\d+){0,4} would return array of Strings, what would (\d){0,1} return? And these qualifiers can be quite deeply nested. I think it's a solvable problem, but it would definitely add some additional complexity.

New variables for multiple matches

We could also have separate variables for $2 (always nil or String as it is now) vs $~2 (always array of Strings). That would actually be the most backwards compatible solution, the downside is that we'd need a bunch of awkward new special variables:

"1,2,3,4,5" =~ /(\d+)(?:,(\d+))*/
p $1 # => "1"
p $2 # => "5"
p $~1 # => ["1"]
p $~2 # => ["2", "3", "4", "5"]

Back references

Either way, back references like \1 need to match only the most recent contents of given match group, even in nested situation.

A simple example of back reference is a regular expression for a quoted string, which can use either single or double quotes, but it needs same quote at start and end: (['"])(.*?)\1.

Putting the whole thing in a ()* group needs to still work:

%q["one",'two',"three"] =~ /((?:(["'])(.*?)\2)(?:,|$))*/

Right now $3 would return just "three", but with this change $3 or $~3 would return all matches ["one", "two", "three"]. And while $2 or $~2 would return an array of matches, \2 must mean only the most recent one, otherwise we have no way to express this case.

Are these still regular expressions?

Yes. Nothing about this proposal affects their regularity. They match exactly the same strings they already do - they just provide some additional information about the match.

Coming next

In the next episode I'll describe a more radical proposal. And after that we'll take a look at how Raku does it.