Open Source Adventures: Episode 78: Exploring Ruby Regular Expression API

It's easy to check if a string matches a regular expression, it's easy s =~ /rx/.

To check if it matches any of multiple regular expressions, it's also really easy - s =~ /rx1|rx2|rx3/. Alternatives are a basic feature of every regular expression engine.

Extracting information about a match is also really easy. If we want to extract year, month, and date from a year we can just do s =~ /(\d\d\d\d)-(\d\d)-(\d\d)/.

OK, what if you want to do both - there's multiple date formats, and you just want to extract information from whichever one matches. That's surprisingly difficult without serious code duplication.

The Problem

As I'm just exploring the API not solving a real problem, I picked a toy example - three date formats with very simple patterns, and the function needs to return array of matches, converted into numbers.

Here's the test case:

%W[
  2015-05-25
  2016/06/26
  27/07/2017
].each do |s|
  p parse_date(s)
end

And the expected output of this program is:

[2015, 5, 25]
[2016, 6, 26]
[2017, 7, 27]

This problem is actually real when parsing, and there's usually a lot more than 3 regular expressions, and they're a lot less symmetric than here.

For sake of presentation I skipped anchors \A \z these regular expressions would normally have.

To avoid backslashes I'm using %r[] syntax rather than // syntax.

Solution 1

def parse_date(s)
  case s
  when %r[(\d\d\d\d)-(\d\d)-(\d\d)]
    [$1.to_i, $2.to_i, $3.to_i]
  when %r[(\d\d\d\d)/(\d\d)/(\d\d)]
    [$1.to_i, $2.to_i, $3.to_i]
  when %r[(\d\d)/(\d\d)/(\d\d\d\d)]
    [$3.to_i, $2.to_i, $1.to_i]
  end
end

The simplest thing to do is just list all the cases. We can do case s when /rx/ instead of chaining elsifs, that's usually cleaner.

There are two problems with this.

First, we really want to use just one regexp, as there could potentially be hundreds of patterns, and doing hundreds of checks is slow, while doing one regexp check with long list of alternatives is usually fast. Regular expression performance is not very predictable (that's why regular expression performance issues come up constantly with DoS CVEs), but single regular expression could be fast, while trying a hundred of them is guaranteed to be slow.

Second, we have a lot of code duplication here. The first two branches are identical, the other is flipped backwards, but the code is doing basically the same thing.

Solution 2

def parse_date(s)
  case s
  when %r[(\d\d\d\d)-(\d\d)-(\d\d)], %r[(\d\d\d\d)/(\d\d)/(\d\d)]
    [$1.to_i, $2.to_i, $3.to_i]
  when %r[(\d\d)/(\d\d)/(\d\d\d\d)]
    [$3.to_i, $2.to_i, $1.to_i]
  end
end

This one doesn't have anything to do with regular expressions, Ruby just lets us list multiple regular expressions (or any matchable objects) under single when, so we reduce code duplication.

This code is not too bad, and if it's not performance critical, this is where most Ruby developers would likely stop.

Solution 3

def parse_date(s)
  case s
  when %r[(\d\d\d\d)([/-])(\d\d)\2(\d\d)]
    [$1.to_i, $3.to_i, $4.to_i]
  when %r[(\d\d)/(\d\d)/(\d\d\d\d)]
    [$3.to_i, $2.to_i, $1.to_i]
  end
end

We can replace the first two regular expressions by one, using back-references. This only works because these two regular expressions are really similar, and not in general case.

Back-references also compromise regular expression performance. "Regular expression" with back-references is actually no longer "regular" in computer science sense, and many regular expression engines switch to slower strategies if any back-references are present. I don't think Ruby engine does that.

Back-references are also a lot less readable than alternatives. For this one it's not too bad, but for something more complicated I'd rather see the list of alternatives.

So this is pretty much a dead end.

Solution 4

def parse_date(s)
  case s
  when %r[(\d\d\d\d)-(\d\d)-(\d\d)|(\d\d\d\d)/(\d\d)/(\d\d)]
    [($1 || $4).to_i, ($2 || $5).to_i, ($3 || $6).to_i]
  when %r[(\d\d)/(\d\d)/(\d\d\d\d)]
    [$3.to_i, $2.to_i, $1.to_i]
  end
end

Well, why don't we just use alternative /rx1|rx2/ like we'd do with matching?

The problem is that matches are numbered from start of the regular expression, so either $1,$2,$3 or $4,$5,$6 will match, the others being nil. This is a technique I use occasionally if there's just two alternatives, but it scales poorly to more complex scenarios.

Solution 5

def parse_date(s)
  case s
  when %r[(\d\d\d\d)-(\d\d)-(\d\d)|(\d\d\d\d)/(\d\d)/(\d\d)|(\d\d)/(\d\d)/(\d\d\d\d)]
    [($1 || $4 || $9).to_i, ($2 || $5 || $8).to_i, ($3 || $6 || $7).to_i]
  end
end

An interesting variant of this is that we can have one code even if alternatives have different order! We could also use if s =~ %r[...] instead of case s when %r[...].

The main problem is that this really doesn't scale. Trying to figure out what's $7 is quite hard, and this is only a few expressions. If you edit expression in the middle, you need to renumber every match variable for every regular expression following it.

Solution 6

def parse_date(s)
  case s
  when
    %r[(?<year>\d\d\d\d)-(?<month>\d\d)-(?<day>\d\d)],
    %r[(?<year>\d\d\d\d)/(?<month>\d\d)/(?<day>\d\d)],
    %r[(?<day>\d\d)/(?<month>\d\d)/(?<year>\d\d\d\d)]
    [$~["year"].to_i, $~["month"].to_i, $~["day"].to_i]
  end
end

In addition to numbered matches, Ruby also supports named matches with (?<name>...). They're available with $~["name"].

This solution has three separate expressions, so it will likely not perform too great, but there's no duplication in the match code - order changes are free.

Solution 7

def parse_date(s)
  case s
  when %r[(?<year>\d\d\d\d)-(?<month>\d\d)-(?<day>\d\d)|(?<year>\d\d\d\d)/(?<month>\d\d)/(?<day>\d\d)|(?<day>\d\d)/(?<month>\d\d)/(?<year>\d\d\d\d)]
    [$~["year"].to_i, $~["month"].to_i, $~["day"].to_i]
  end
end

It turns out that you can have multiple groups share the same name, so one regular expression can deal with all cases.

The problem with this solution is that it results in very long lines with no spacing.

Solution 8

def parse_date(s)
  case s
  when %r[
      (?<year>\d\d\d\d)-(?<month>\d\d)-(?<day>\d\d) |
      (?<year>\d\d\d\d)/(?<month>\d\d)/(?<day>\d\d) |
      (?<day>\d\d)/(?<month>\d\d)/(?<year>\d\d\d\d)
    ]x
    [$~["year"].to_i, $~["month"].to_i, $~["day"].to_i]
  end
end

Regular expressions have problems with readability, which could be easily solved with some spacing and comments, but space character is taken to mean literal space, and # to mean literal #.

Fortunately most languages support //x or equivalent, which just allows spaces and comments inside regular expressions, greatly improving their readability.

This might be the nicest solution so far.

Solution 9

def parse_date(s)
  if %r[
      (?<year>\d\d\d\d)-(?<month>\d\d)-(?<day>\d\d) |
      (?<year>\d\d\d\d)/(?<month>\d\d)/(?<day>\d\d) |
      (?<day>\d\d)/(?<month>\d\d)/(?<year>\d\d\d\d)
    ]x =~ s
    [year.to_i, month.to_i, day.to_i]
  end
end

And finally, Ruby has a solution, which I feel rather uncomfortable with, of making regular expression directly assign results to local variables. Regular expressions can come from users, and overwriting random local variables is a recipe for disaster. In this case we need to explicitly opt-into this by doing /rx/ =~ s instead of s =~ /rx/ or case s when /rx/, but this is in no way obvious, and it really wouldn't surprise me if there were some serious undiscovered security vulnerabilities exploiting this.

Because we need explicit opt-in, we can't do case s when /rx/, so if we had multiple branches we couldn't unify, we'd need to repeat elsif /.../ =~ s a few times.

Other than that small problem, this looks really nice.

Story so far

All the code is on GitHub.

Coming next

In the next episode we'll see how other languages handle this problem.