100 Languages Speedrun: Episode 74: Python ANTLR 4

Parser generators existed since the late 1960s, but even today they still see very little use for parsing programming languages. Almost every parser out there is either hand-coded, or uses a parser generator with so many hacks that it might just as well be (Ruby).

The primary reason is that parser generators just aren't good enough. ANTLR is doing its best to be that good enough parser generator.

In terms of its ability to parse things, it's likely the best one out there. ANTLR is built upon different principles (LL) than other parser generators (historically LR, nowadays also occasionally PEG), but we won't be getting into any theory here today. None of the parser generators strictly adheres to these categories anyway.

The worst thing about ANTLR 4 is Java. The generator itself is Java-based, and is supports a short list of languages - with many previously supported languages for ANTLR 1-3 like Ruby no longer there. But worst of all, its API for the usually nice languages like Python is still painfully Java-like, full of boilerplate code, and very unpleasant to use.

Math Language

For this episode I'll implement a language for doing simple math calculations. It can ask for values it doesn't know, and it prints the final results. Here's a few programs for it.

a.math - just testing that all the rules are correct:

300 + 50 * 4 + 80 / 4 - (80 - 30) * 2

miles_to_km.math - very simple unit conversion program:

miles * 1.60934

circle_area.math - a program to check that we'll be asked about r only once, not twice:

3.14159265359 * r * r

A language like that is nothing special, but it's at least tiny bit useful.

Math Language In Action

I'll start a bit backwards, and show our goal first:

$ ./listener/Math.py examples/a.math
420.0
$ ./listener/Math.py examples/miles_to_km.math
Enter value for miles: 69
111.04446
$ ./listener/Math.py examples/circle_area.math
Enter value for r: 20
1256.6370614359998

That's the interactions we want. For every variable in the program, it asks us for its value. Everything is a float, all the usual math rules are followed, and result is automatically printed out.

Dependencies

There are two things we need - antlr itself, which we'll only need to compile .g4 grammar file into a bunch of .py Python modules. And antlr-python3-runtime, which our program will need to run. The antlr-python3-runtime should be added to your program's dependencies.

$ brew install antlr
$ pip3 install antlr4-python3-runtime

`Math.g4`

So here's the grammar file. It will be identical for both versions that follow:

grammar Math;

expr : expr ('+' | '-') term
     | term;

term : term ('*' | '/') factor
     | factor;

factor : '(' expr ')'
       | number
       | identifier;

number: NUM;
identifier: ID;

ID : [a-zA-Z_] [a-zA-Z0-9_]* ;
NUM : '-'? [0-9]+ ( '.' [0-9]*)?;

WS : [ \t\r\n]+ -> skip; // ignore all whitespace

To compile it, run antlr -Dlanguage=Python3 Math.g4. It will then generate three files MathLexer.py, MathParser.py, and MathListener.py, which you can import like any Python modules. Also a few extra debug files that you won't need for anything, I'm not really sure why it bothers generating them by default.

So let's explain the grammar:

grammar Name; - should generally match the file name
rule : definition; - how various parts of the grammar are defined, parser rules are lower case, lexer rules are upper case
expr : expr ('+' | '-') term | term; - this means expr is one of: expr + term, expr - term, or term
term : term ('*' | '/') factor | factor; - this means term is one of: term * factor, term / factor, or factor
factor : '(' expr ')' | number | identifier - this means is one of: ( expr ), number, or identifier
number is a float, by regular expression
identifier is any name, by regular expression
WS is whitespace, we do -> skip as special rule to tell ANTLR to just ignore it between tokens

Now if you somehow remember anything about parsing theory, you might have noticed that expr and term definitions are "left recursive", that is they start by referring to themselves. Traditional LL parsers can't deal with it, because they'd work as this:

is what follows expr? let's find out! First alternative is expr ('+' | '-') term, so lets check this
is what follows expr ('+' | '-') term? Let's do this one at a time, starting with expr
is what follows expr? Oh wait, we're in an infinite loop

So traditionally, if you wanted to use LL grammar, you'd need to rewrite such rules, for example to expr: term (('+'|'-') term)*. And internally it's sort of true, but ANTLR 4 does all of that behind the scenes, supporting every directly left recursive grammar.

It still doesn't automatically support indirect left recursion (a starts with b ... and b starts with a ...) - so for such grammars you'd need to give it a hand, but directly left recursive grammars are extremely common (like, everything that includes math expressions), while indirectly left recursive grammars are rare. And all the usual tricks for dealing with them still work.

How ANTLR works

We are ready to parse things, but to write a program to use the parser we need to do some extra work. ANTLR 4 supports three main ways:

First, we can put code directly inside .g4 files - this is how most parser generators work, but ANTLR 4 hates this idea. Python target for ANTLR 4 hates it even more, as it just refuses to handle Python indentation in any sensible way. It's weird just how much they try to make you not do this, especially since this is how previous versions of ANTLR worked, it's how most other parser generators work, and it's generally the most concise way. They give stupid arguments that then you can't share your grammar between different languages, but that's almost never relevant. Anyway, I'm not going to force ANTLR to do it this way, at least not for this episode.

Second, we can construct a tree, and then use a "visitor" pattern. A visitor is passed top level node corresponding to the whole program, and we manually go down the parse tree whichever way we want. The best thing about it is that the visitor naturally lends itself to a series of methods like parseSomeType, each of which parses node of someType and returns whatever is appropriate. The downside is that it's a lot of boilerplate code for the recursion.

Third, we can use a "listener", and that's the third file ANTLR generated. With a listener, we still construct a parse tree, but instead of doing recursing ourselves, we have a walker walk the tree and call enterNodeType and exitNodeType for us - with all of these defaulting to empty functions, so we only need to override the ones we want. The downside is that we cannot return anything - it turns the whole program into a series of enter and exit events, we need to pass the data by some side channel.

There are some other ways, but let's stick to that.

Visitor

Here's implementation of our language following the "visitor" pattern:

#!/usr/bin/env python3

from antlr4 import *
from MathLexer import MathLexer
from MathParser import MathParser
import sys

class MathProgram:
  def __init__(self, program):
    self.program = program

  def evalExpr(self, node):
    children = node.children
    if len(children) == 3:
      a = self.evalExpr(children[0])
      b = children[1].getText()
      c = self.evalTerm(children[2])
      if b == "+":
        return a + c
      else:
        return a - c
    else:
      return self.evalTerm(children[0])

  def evalTerm(self, node):
    children = node.children
    if len(children) == 3:
      a = self.evalTerm(children[0])
      b = children[1].getText()
      c = self.evalFactor(children[2])
      if b == "*":
        return a * c
      else:
        return a / c
    else:
      return self.evalFactor(children[0])

  def evalFactor(self, node):
    children = node.children
    if len(children) == 3:
      return self.evalExpr(children[1])
    elif node.number():
      return float(node.number().getText())
    else:
      return self.getVar((node.identifier().getText()))

  def getVar(self, name):
    if name not in self.vars:
      self.vars[name] = float(input(f"Enter value for {name}: "))
    return self.vars[name]

  def run(self):
    self.vars = {}
    result = self.evalExpr(self.program)
    print(result)

def parseFile(path):
  lexer = MathLexer(FileStream(path))
  stream = CommonTokenStream(lexer)
  parser = MathParser(stream)
  tree = parser.expr()
  MathProgram(tree).run()

if __name__ == "__main__":
  path = sys.argv[1]
  parseFile(path)

Step by step:

there's considerable boilerplate for what should just be MathParser.parseFile(path).expr()
in ANTLR, there's no concept of "top level" grammar rule, you can start with anything, that's why we explicitly call expr - we can just as well ask for number or anything else. In some other kinds of parser generators, there's a single special top level rule (for example Raku's builtin PEG parsers call it TOP).
the API is really ugly - for most obvious example whoever wrote it decided to go with Java-level cringe .getText() instead of declaring .text as @property. .children only works because it's directly an attribute, officially they'd rather you use the same Java cringe .getChildren(), .getChildCount() etc.
if we could put code directly in the grammar, we could do expr: expr '+' term { code } | expr '-' term { code } | term { code }, so we wouldn't need to figure out which branch we followed the second time, even though the parser already did that
all the self.evalSomething(children[i]) feels like annoying boilerplate code

Listener

The alternative to this is "listener" pattern. Which sounds sort of like we're heading for a SAX-style XML parser, but it's not really true. We have access to the whole tree available at all times. Here's the implementation:

#!/usr/bin/env python3

from antlr4 import *
from MathLexer import MathLexer
from MathParser import MathParser
from MathListener import MathListener
import sys

class MathProgram(MathListener):
  def exitNumber(self, node):
    value = float(node.getText())
    self.stack.append(value)

  def exitIdentifier(self, node):
    value = self.getVar(node.getText())
    self.stack.append(value)

  def exitTerm(self, node):
    if len(node.children) == 3:
      b = self.stack.pop()
      a = self.stack.pop()
      if node.children[1].getText() == "*":
        self.stack.append(a * b)
      else:
        self.stack.append(a / b)

  def exitExpr(self, node):
    if len(node.children) == 3:
      b = self.stack.pop()
      a = self.stack.pop()
      if node.children[1].getText() == "+":
        self.stack.append(a + b)
      else:
        self.stack.append(a - b)

  def getVar(self, name):
    if name not in self.vars:
      self.vars[name] = float(input(f"Enter value for {name}: "))
    return self.vars[name]

  def run(self, node):
    self.stack = []
    self.vars = {}
    ParseTreeWalker().walk(self, node)
    result = self.stack[0]
    print(result)

def parseFile(path):
  lexer = MathLexer(FileStream(path))
  stream = CommonTokenStream(lexer)
  parser = MathParser(stream)
  tree = parser.expr()
  MathProgram().run(tree)

if __name__ == "__main__":
  path = sys.argv[1]
  parseFile(path)

You need to figure out how you'll handle the data. Because how simple our language is, we could get away with just self.stack, and pushing and popping it.

Step by step:

it's even more boilerplate to get started, additional ParseTreeWalker().walk(self, node) is needed, even though we inherit from a base class that really could have that logic itself
the API reeks of Java just much as the previous one
we managed to skip all the enter nodes, and most of the exit nodes - we can also skip the rules like expr: term, term: factor, and factor: ( expr ) as numerical value isn't changed there - so at least for this simple program it's less verbose than the visitor version
you need to figure out how you'll be passing data around on your own - there's no universal solution

These are just two basic design patterns recommended by ANTLR. You can absolutely use another pattern, like for example add extra methods directly to each node type and run that, put code directly into the grammar, or some even different way.

Should you use Python ANTLR 4?

ANTLR 4 is probably the most powerful parser generator out there, so in this sense it's a good choice, even if the API is just nasty.

Usually parsing will be only a small part of your program, so you can pinch your nose and get over the foul stench of Java. It might very well be a better choice than writing a recursive descent parser yourself.

As for improving ANTLR 4 itself, there are two ways. From what I've seen, porting ANTLR 4 to a new language would require some amount of effort, but very little difficulty. It's basically around 10k lines of code to rewrite class-by-class, to implement ANTLR runtime in your language, and far less to make code generator work. You won't need to understand any of the underlying theory. It seems like a reasonable side project for a few weeks.

The other way, and something I really hope someone does, is to figure out some design patterns that fit non-Java languages better, and pull request those.

For Python, it's just crazy that these aren't supported:

node.text
SomeParser().parseFile(path).aNodeType()

But it would take a lot more than that to make it feel like it fits the language.

Code

All code examples for the series will be in this repository.

Code for the Python ANTLR 4 episode is available here.