Tomasz Węgrzanowski
Tomasz Węgrzanowski's Blog

Tomasz Węgrzanowski's Blog

Open Source Adventures: Episode 80: Exploring Python Regular Expression API

Tomasz Węgrzanowski's photo
Tomasz Węgrzanowski
·Sep 12, 2022·

5 min read

In previous two episodes I explored regular expression APIs of Ruby, and Crystal, so let's do the same exercise in Python.

The problem is the same - there's multiple date formats, and we want to extract information from whichever one matches.

I'm doing it with just 3 regular expressions, but in real world there could be hundreds. Doing it naively with a list of regular expressions would require massive code duplication, and a lot of calls to regular expression engine, which is generally dramatically slower than just matching a|b|c|... once.

The Problem

Here's the test case:

#!/usr/bin/env python3

import re

def parse_date(s):
  ...

for date in ["2015-05-25", "2016/06/26", "27/07/2017"]:
  print(parse_date(date))

And expected output:

[2015, 5, 25]
[2016, 6, 26]
[2017, 7, 27]

Solution 1

def parse_date(s):
  m = re.search(r'(\d\d\d\d)-(\d\d)-(\d\d)', s)
  if m:
    return [int(m[1]), int(m[2]), int(m[3])]
  else:
    m = re.search(r'(\d\d\d\d)/(\d\d)/(\d\d)', s)
    if m:
      return [int(m[1]), int(m[2]), int(m[3])]
    else:
      m = re.search('(\d\d)/(\d\d)/(\d\d\d\d)', s)
      if m:
        return [int(m[3]), int(m[2]), int(m[1])]

The naive solution in Python is just awful. Python doesn't have any kind of $~, $1, etc. so we need to assign to a match variable. This isn't a big deal, most languages don't have $~, so we'd normally be able to do a chain of if m = re.search(...) followed by elif m = re.search(...) etc.

Python however, is not like most languages, and doesn't treat assignment as expression, so if m = re.search(...) is not allowed. So porting code from other languages to Python just results in truly disgusting code with ever increasing indentation. It already looks bad with 3, just imagine 100 (even with 2 space indentation, I'm not doing 4 spaces on the blog, as it looks unreadable on narrow screens).

Solution 2

def parse_date(s):
  if m := re.search(r'(\d\d\d\d)-(\d\d)-(\d\d)', s):
    return [int(m[1]), int(m[2]), int(m[3])]
  elif m := re.search(r'(\d\d\d\d)/(\d\d)/(\d\d)', s):
    return [int(m[1]), int(m[2]), int(m[3])]
  elif m := re.search('(\d\d)/(\d\d)/(\d\d\d\d)', s):
    return [int(m[3]), int(m[2]), int(m[1])]

Python 3.8 finally added a feature which all other languages had, but instead of making = assignment an expression, it introduced a new and very limited :=.

We still have considerable code duplication here, but it's much better than the pre-3.8 solution.

Solution 3

def parse_date(s):
  m = re.search(r'(\d\d\d\d)-(\d\d)-(\d\d)', s)
  if m:
    return [int(m[1]), int(m[2]), int(m[3])]
  m = re.search(r'(\d\d\d\d)/(\d\d)/(\d\d)', s)
  if m:
    return [int(m[1]), int(m[2]), int(m[3])]
  m = re.search('(\d\d)/(\d\d)/(\d\d\d\d)', s)
  if m:
    return [int(m[3]), int(m[2]), int(m[1])]

Because we're returning from every branch anyway, we don't actually need to do else. I think this is actually a more realistic pre-3.8 solution, but it only works because we extracted the logic to top level of a function or method, it's not always possible.

Solution 4

def parse_date(s):
  if m := re.search(r'(\d\d\d\d)-(\d\d)-(\d\d)', s) or re.search(r'(\d\d\d\d)/(\d\d)/(\d\d)', s):
    return [int(m[1]), int(m[2]), int(m[3])]
  elif m := re.search('(\d\d)/(\d\d)/(\d\d\d\d)', s):
    return [int(m[3]), int(m[2]), int(m[1])]

We can reduce code duplication by doing m := rx1 or rx2.

Solution 5:

def parse_date(s):
  if m := re.search(r'(\d\d\d\d)-(\d\d)-(\d\d)|(\d\d\d\d)/(\d\d)/(\d\d)', s):
    return [int(m[1] or m[4]), int(m[2] or m[5]), int(m[3] or m[6])]
  elif m := re.search('(\d\d)/(\d\d)/(\d\d\d\d)', s):
    return [int(m[3]), int(m[2]), int(m[1])]

We can reduce number of calls to the regular expression engine, and extract whichever group matches with m[num] or m[num].

There's a Python specific gotcha here. In Python empty string in falsey, so (m[1] or m[4]) will evaluate to m[4] not only when m[1] is None, but also when m[1] matches an empty string "". So you can only use this technique if you're sure no group is ever empty.

I'm not even sure if Python has any a or b that only evaluates b if a is specifically None.

Solution 6:

def parse_date(s):
  if m := re.search(r'(\d\d\d\d)-(\d\d)-(\d\d)|(\d\d\d\d)/(\d\d)/(\d\d)|(\d\d)/(\d\d)/(\d\d\d\d)', s):
    return [
      int(m[1] or m[4] or m[9]),
      int(m[2] or m[5] or m[8]),
      int(m[3] or m[6] or m[7]),
    ]

This same technique, with the same gotcha, can be used even if different patterns have different order. Of course it really doesn't scale, and even this is too many indexes.

Solution 7

def parse_date(s):
  if m := re.search(r'''(?x)
    (\d\d\d\d)-(\d\d)-(\d\d) |
    (\d\d\d\d)/(\d\d)/(\d\d) |
    (\d\d)/(\d\d)/(\d\d\d\d)
    ''', s):
    return [
      int(m[1] or m[4] or m[9]),
      int(m[2] or m[5] or m[8]),
      int(m[3] or m[6] or m[7]),
    ]

Python equivalent of //x is (?x) which must be placed immediately at the start of the regular expression, without any preceding whitespace.

Solution 8

def parse_date(s):
  if m := (
      re.search(r'(?P<year>\d\d\d\d)-(?P<month>\d\d)-(?P<day>\d\d)', s) or
      re.search(r'(?P<year>\d\d\d\d)/(?P<month>\d\d)/(?P<day>\d\d)', s) or
      re.search(r'(?P<day>\d\d)/(?P<month>\d\d)/(?P<year>\d\d\d\d)', s)
    ):
    return [int(m["year"]), int(m["month"]), int(m["day"])]

Python has a slightly different syntax for named groups, (?P<name>...) compared with Ruby's (?<name>...).

Solution 9

def parse_date(s):
  if m := re.search(r'''(?x)
    (?P<year>\d\d\d\d)-(?P<month>\d\d)-(?P<day>\d\d) |
    (?P<year>\d\d\d\d)/(?P<month>\d\d)/(?P<day>\d\d) |
    (?P<day>\d\d)/(?P<month>\d\d)/(?P<year>\d\d\d\d)
    ''', s):
    return [int(m["year"]), int(m["month"]), int(m["day"])]

And we can finally reach the solution we wanted - symbolic names at appropriate positions. One regular expression engine call, no code duplication, perfection.

Except it doesn't work in Python at all, as Python requires every group name to be unique.

As far as I know, there's no way to express it in Python as clearly as we did with Ruby/Crystal.

Solution 10

def parse_date(s):
  if m := re.search(r'''(?x)
    (?P<year1>\d\d\d\d)-(?P<month1>\d\d)-(?P<day1>\d\d) |
    (?P<year2>\d\d\d\d)/(?P<month2>\d\d)/(?P<day2>\d\d) |
    (?P<day3>\d\d)/(?P<month3>\d\d)/(?P<year3>\d\d\d\d)
    ''', s):
    return [
      int(m["year1"] or m["year2"] or m["year3"]),
      int(m["month1"] or m["month2"] or m["month3"]),
      int(m["day1"] or m["day2"] or m["day3"]),
    ]

But before we give up, let's try one more thing. Every group name is unique. This looks extremely verbose - and runs into the gotcha that any group matching empty substring would crash it - but it's probably more readable and scalable than using very high numbers.

Story so far

Python definitely disappointed a bit here. There are some solutions that are not too awful, but there's no way to fully avoid code duplication (like solution 9 would if it worked), and all the solutions using or chains only work if none of the match group can be empty, limitations Ruby and Crystal didn't have.

All the code is on GitHub.

Coming next

In the next episode we'll see how other languages handle this problem.

 
Share this