100 Languages Speedrun: Episode 68: Raku (Perl 6) Grammars

I covered Raku and Raku regular expressions before, now it's time to complete this with an episode about Raku grammars.

Processing text is one of the most common tasks in programming, so most languages since Perl come with some kind of regular expression system. Generally the basic regexps are the same, but each language comes with some unique extensions. Raku notably does not keep the traditional regexp syntax even for the common things, but conceptually they're mostly the same.

A regular expression can extract information from a string based on some pattern, but often we need something more advanced. So a lot of languages also have some way to define tokenizers, which chop the string into pieces based on multiple regular expressions. Notably Perl's /gc flag, and Ruby StringScanner.

Raku takes it one step further, and supports defining grammars as part of the language. Pretty much every other language needs external tools like Yacc or ANTLR for that.

Named regular expressions

Before we get to grammars, let's take a tiny detour to regular expressions. In addition to defining them in one go, we can also define them in pieces:

#!/usr/bin/env raku

my regex digit { <[0..9]> };
my regex nonzero_digit { <[1..9]> };
my regex byte {
  [
  | <digit>                 # 0-9
  | <nonzero_digit> <digit> # 10-99
  | 1 <digit> <digit>       # 100-199
  | 2 <[0..4]> <digit>      # 200-249
  | 25 <[0..5]>             # 250-255
  ]
};
my regex ipv4 { ^ <byte> ** 4 % "." $ };

my @examples = (
  "1.2.3.4",
  "127.0.0.1",
  "100.200.300.400",
  "02.03.04.05",
  "1.2.3.4.5",
  "it's a kitty!",
  "69.420.69.420",
  "10.20.30.40 ",
  "127.0.0.1.",
  "๓.๓.๓.๓", # \d bug still there
);
for @examples {
  say "VALID: '$_'" if $_ ~~ &ipv4;
  say "NOT VALID: '$_'" if $_ !~~ &ipv4;
}

Initially I thought I could even define some mutually recursive regular expressions, and get simple parsers this way, but that is not actually supported.

Notice unusual sigils we need to use in this case - <digit> inside regular expression, and &ipv4 outside it.

Key Value Grammar

Let's start with extremely simple grammar, one that matches a single key value pair, with a single word on both sides:

#!/usr/bin/env raku

grammar KV {
  rule TOP { ^ <key> "=" <value> $ };
  regex id { <[a..zA..Z0..9]> + };
  regex key { <id> };
  regex value { <id> };
};

my @examples = (
  "cat=cute",
  "4=5",
  "cat = fluffy",
  "cat",
  "cat=dog=ferret",
  "c a t = d o g",
);
for @examples {
  if KV.parse($_) {
    say "PARSED: $_";
    say "MATCH: $/";
    say "KEY IS: $/.<key>";
    say "VALUE IS: $/.<value>";
    say "PARSE TREE:";
    say $/;
  } else {
    say "FAILED: $_";
  }
  say "";
}

$ ./kv.raku
PARSED: cat=cute
MATCH: cat=cute
KEY IS: cat
VALUE IS: cute
PARSE TREE:
｢cat=cute｣
 key => ｢cat｣
  id => ｢cat｣
 value => ｢cute｣
  id => ｢cute｣

PARSED: 4=5
MATCH: 4=5
KEY IS: 4
VALUE IS: 5
PARSE TREE:
｢4=5｣
 key => ｢4｣
  id => ｢4｣
 value => ｢5｣
  id => ｢5｣

PARSED: cat = fluffy
MATCH: cat = fluffy
KEY IS: cat
VALUE IS: fluffy
PARSE TREE:
｢cat = fluffy｣
 key => ｢cat｣
  id => ｢cat｣
 value => ｢fluffy｣
  id => ｢fluffy｣

FAILED: cat

FAILED: cat=dog=ferret

FAILED: c a t = d o g

Let's cover everything that's going on here, and it's a lot:

we can define grammar sort of like we declare a class
tokens are declared with regex (or token, which has slightly different rules)
parse rules are defined with rule, and can refer to other rules, named regexps, or just plain strings
parse rule TOP is the default rule for the whole grammar
by default, whitespace is allowed between tokens, notice cat = fluffy parsed even though we didn't say anything about optional spaces
this does not happen within the regex, notice how c a t = d o g failed to parse
we can override this whitespace parsing by redefining ws regexp
we can use KV.parse($string) or KV.parsefile($filename) to parse a string or a file
parse result is returned from KV.parse, and also saved to $/, the same one used for regexp matching
we can navigate the parse tree with .<name>, we can also stringify any part of it, it will return the matching part

If we use grammars this way, we don't really gain much over just using regular expressions.

Actions

The second part of parsing is actions associated with each rule. In many parsing languages, grammars and their actions are defined together. In Raku they're defined separately, which makes it easy to have multiple sets actions for the same grammar.

#!/usr/bin/env raku

grammar KV {
  rule TOP { ^ <key> "=" <value> $ };
  regex id { <[a..zA..Z0..9]> + };
  regex key { <id> };
  regex value { <id> };
};

class KVActions {
  method TOP($/) {
    make { $/.<key>.made => $/.<value>.made }
  }
  method key($/) { make $/.<id>.made }
  method value($/) { make $/.<id>.made }
  method id($/) { make $/.Str }
};

my @examples = (
  "cat=cute",
  "4=5",
  "cat = fluffy",
  "cat",
  "cat=dog=ferret",
  "c a t = d o g",
);
for @examples {
  if KV.parse($_, actions => KVActions) {
    say "PARSED: $_";
    say "RESULT: ", $/.made;
  } else {
    say "FAILED: $_";
  }
  say "";
}

$ ./actions.raku
PARSED: cat=cute
RESULT: {cat => cute}

PARSED: 4=5
RESULT: {4 => 5}

PARSED: cat = fluffy
RESULT: {cat => fluffy}

FAILED: cat

FAILED: cat=dog=ferret

FAILED: c a t = d o g

What's going on:

we define actions with class KVActions { } and each action with method name($/) { ... }, name corresponding to a rule in the grammar
by the way this is regular class, so you can do regular OO stuff like inheriting from it, override specific actions, metaprogram etc.
there are many ways to use it, but generally to create a token you do make somedata, and it will return match object with somedata associated with given rule
you can access subtrees and their associated data with $/.<subrule>.made
you can access text of what was matched by different subtrees with $/.<subrule>.Str
once you parse the whole tree, you can do $/.made to access result of the whole parsing
this is just the usual way of doing things, there are of course many other ways

RPN Calculator

RPN Calculator grammar is very easy - it's just a list of tokens, with optional spaces between them ignored. We could just use a tokenizer for this, but let's use a Raku grammar.

#!/usr/bin/env raku

grammar RPN {
  rule TOP { ^ <tok>* $ };
  rule tok { <number> | <op> };
  regex number { "-"? <[0..9]>+ ["." <[0..9]>*]? };
  regex op { "+" | "-" | "*" | "/" | "in" | "out" };
}

class RPNActions {
  method TOP($/) { make [$<tok>.map(*.made)] }
  method tok($/) { make $<number> ?? $<number>.made !! $<op>.made }
  method number($/) { make {type => "number", value => +$/.Str} }
  method op($/) { make {type => $/.Str} }
}

sub run($m) {
  my @stack = ();
  for @$m {
    my $type = $_{'type'};
    my $value = $_{'value'};
    given $type {
      when "number" { @stack.push($value) }
      when "in" { @stack.push(+$*IN.get) }
      when "/" { my $a = @stack.pop; my $b = @stack.pop; @stack.push($b / $a) }
      when "*" { my $a = @stack.pop; my $b = @stack.pop; @stack.push($b * $a) }
      when "+" { my $a = @stack.pop; my $b = @stack.pop; @stack.push($b + $a) }
      when "-" { my $a = @stack.pop; my $b = @stack.pop; @stack.push($b - $a) }
      when "out" { say @stack.pop }
      default { die "Unknown operator $type" }
    }
  }
}

unless @*ARGS == 1 {
  note "Usage: $*PROGRAM-NAME <file.rpn>";
  exit 1;
}
my $path = @*ARGS[0];

if RPN.parsefile($path, actions => RPNActions) {
  run $/.made
} else {
  note "Parse error";
}

$ ./rpn.raku rpn/temperature_c_to_f.rpn
0
32
$ ./rpn.raku rpn/temperature_c_to_f.rpn
100
212
$ ./rpn.raku rpn/temperature_c_to_f.rpn
25.7
78.26
$ ./rpn.raku rpn/mile_to_km.rpn
1000
1609.34

What's going on:

grammar has very few simple rules - optional whitespaces between tokens are implied because we didn't turn them off. I don't think this should really be the default, as it makes grammar ambiguous (what is whitespace? just spaces and tabs? newlines? any exotic Unicode stuff?), but it's useful for these toy cases
$<tok> is the same as $/.<tok>, for both normal regexps and grammars
actions for op and number are simple enough, + operator in Raku converts to a number
action for tok already looks bad with just two cases, and it would look awful if we listed every kind of op separately. Raku has better syntax to deal with this alternatives, because it absolutely needs it.
action for TOP is interesting - as we had <tok>*, $<tok> is already an array - as we only care about tokens made, we just .map(*.made) it. Due to Raku having Perl-style scalar vs list context distinction, we need to wrap that in extra []s.
the whole process returns single $/, with array of tokens as that's what we constructed
it's not really part of the grammar, but I added simple run function to run our RPN program

Parser Feedback

OK, let's do something more complicated, and also let's increase our expectations as to what we want the parser to do. Here's a simple grammar for sums expressions, and yes, it is ambiguous as 2+3+4 can be parsed as 2+(3+4) or (2+3)+4. But we won't be feeding it any such input:

#!/usr/bin/env raku

grammar Math {
  rule TOP { ^ <expr> $ };
  rule expr { <number> | <parens> | <operation> };
  rule parens { "(" <expr> ")" };
  rule operation { <expr> <op> <expr> };
  regex number { "-"? <[0..9]>+ ["." <[0..9]>*]? };
  regex op { "+" };
};

my @examples = (
  "100",
  "-4.20",
  "((5))",
  "2+3",
  "(2+3)",
);

for @examples {
  if Math.parse($_) {
    say $/
  } else {
    note "Parse error: $_"
  }
}

Let's give it a try:

$ ./math.raku
｢100｣
 expr => ｢100｣
  number => ｢100｣
｢-4.20｣
 expr => ｢-4.20｣
  number => ｢-4.20｣
｢((5))｣
 expr => ｢((5))｣
  parens => ｢((5))｣
   expr => ｢(5)｣
    parens => ｢(5)｣
     expr => ｢5｣
      number => ｢5｣
Parse error: 2+3
^C

It completely fails to parse 2+3, and it just hangs there on (2+3). What?

Grammar Debugging

I was really baffled by what was going on. I could make the grammar completely unambiguous in a few ways, but that didn't solve the problem at all - different grammars I'd write would result in different baffling issues.

Eventually it occurred to me that this is neither LL-style or LR-style parsing I was expecting, but PEG, which is highly sensitive to order of alternatives. It basically YOLOs any ambiguity by always picking the first matching alternative as soon as it can, even if it then can't match the rest. I only have experience with LR and LL parsing, not PEG, so I dropped my idea of writing any big grammars in Raku.

Anyway, the problem isn't that I run into such issues, or that it's PEG parsing. The problem is that traditional parser generator tools normally come with tools to debug grammar issues; and if you write recursive descend parser by hand, it's just functions calling other functions which you can debug the usual ways. Raku just lacks such tools. If you have problems with your grammar, tough luck.

And while this problem would apply to every kind of parsing, PEG grammars are far more sensitive to tiny changes than traditional LL and LR grammars, so you'll need debugging tools more.

Working Sums Grammar

I eventually figured out one way to make PEG grammars work for simple sums. The problem was that Raku was of no help and I had to figure it out from just the theory:

#!/usr/bin/env raku

grammar Math {
  rule TOP { ^ <expr> $ };
  rule expr { <value> + % <op> };
  rule value { <parens> | <number> };
  rule parens { "(" <expr> ")" };
  regex number { "-"? <[0..9]>+ ["." <[0..9]>*]? };
  regex op { "+" };
};

my @examples = (
  "100",
  "-4.20",
  "((5))",
  "2+3",
  "(2+3)",
  "2+3+4",
);

for @examples {
  if Math.parse($_) {
    say $/
  } else {
    note "Parse error: $_"
  }
}

$  ./math2.raku
｢100｣
 expr => ｢100｣
  value => ｢100｣
   number => ｢100｣
｢-4.20｣
 expr => ｢-4.20｣
  value => ｢-4.20｣
   number => ｢-4.20｣
｢((5))｣
 expr => ｢((5))｣
  value => ｢((5))｣
   parens => ｢((5))｣
    expr => ｢(5)｣
     value => ｢(5)｣
      parens => ｢(5)｣
       expr => ｢5｣
        value => ｢5｣
         number => ｢5｣
｢2+3｣
 expr => ｢2+3｣
  value => ｢2｣
   number => ｢2｣
  op => ｢+｣
  value => ｢3｣
   number => ｢3｣
｢(2+3)｣
 expr => ｢(2+3)｣
  value => ｢(2+3)｣
   parens => ｢(2+3)｣
    expr => ｢2+3｣
     value => ｢2｣
      number => ｢2｣
     op => ｢+｣
     value => ｢3｣
      number => ｢3｣
｢2+3+4｣
 expr => ｢2+3+4｣
  value => ｢2｣
   number => ｢2｣
  op => ｢+｣
  value => ｢3｣
   number => ｢3｣
  op => ｢+｣
  value => ｢4｣
   number => ｢4｣

This could be extended to other operators, and in the end my solution is somewhat similar to what an LL grammar would look like.

Should you use Raku Grammars?

I definitely support the idea of adding grammars to the language, and it's already useful enough for libraries like JSON::Tiny, but right now I don't think this is good enough execution yet.

You as grammar writer won't get any feedback about the issues. And users won't get any feedback about why their input isn't matching.

For anyone interested in implementing grammar parsing in their own language, I'd recommend going the LL(*) route like ANTLR does it, not PEG route. PEG grammars can be good enough for some things, but they're so much harder to write and debug, and a lot more restrictive. But either way, you really need to work on your tooling a lot more than Raku did.

Code

All code examples for the series will be in this repository.

Code for the Raku (Perl 6) Grammars episode is available here.