100 Languages Speedrun: Episode 55: Better Thue Interpreter in Crystal

When I started the series I said that some of the episodes will be about creating a new programming language, but so far it's just been RPN calculators.

So let's create a better Thue! I made an episode about Thue, so read that for some background.

Thue is a fun esoteric language, but suffers from two annoying issues:

  • newline handling requires special weird rules
  • a lot of nearly identical rules needed to be copied and pasted as there are no wildcard matches

I'll be using Crystal to code the interpreter(taw.hashnode.dev/100-languages-speedrun-epi..).

Basic Interpreter

Before I start adding new features, let's write a base interpreter.

class ThueRule
  getter :left

  def initialize(@left : String, @right : String)
  end

  def apply(str, idx)
    before = str[0, idx]
    after = str[idx+@left.size .. -1]

    if @right == "~"
      print "\n"
      replacement = ""
    elsif @right[0] == '~'
      print @right[1..-1]
      replacement = ""
    elsif @right == ":::"
      replacement = STDIN.gets.not_nil!.chomp
    else
      replacement = @right
    end

    before + replacement + after
  end

  def to_s(io)
    io << "Rule<#{@left}::=#{@right}>"
  end
end

class ThueProgram
  def initialize
    @rules = [] of ThueRule
    @initial = ""
    @state = ""
  end

  def load(path)
    lines = File.read_lines(path).map(&.chomp).zip(1..)

    while lines.size > 0
      line, line_no = lines.shift
      # Ignoring invalid rules, they are sometimes used as comments
      next unless line =~ /\A(.*)::=(.*)\z/
      break if $1 == ""
      @rules << ThueRule.new($1, $2)
    end

    @initial = lines.map(&.first).join("\n")
  end

  def random_match
    match = nil
    matches_count = 0
    @rules.each do |rule|
      idx = 0
      loop do
        match_idx = @state.index(rule.left, idx)
        break unless match_idx
        idx = match_idx + 1
        matches_count += 1
        next unless rand(matches_count) == 0
        match = {rule: rule, idx: match_idx}
      end
    end
    match
  end

  def run
    @state = @initial
    while match = random_match
      rule = match[:rule]
      idx = match[:idx]
      if ENV["DEBUG"]?
        STDERR.puts "Applying rule #{rule} at #{idx} to #{@state.inspect}"
      end
      @state = rule.apply(@state, idx)
    end
  end
end

unless ARGV[0]
  STDERR.puts "Usage: #{$0} <file.thue>"
  exit 1
end

prog = ThueProgram.new
prog.load(ARGV[0])

# Crystal doesn't handle SIGPIPE well and we want to support:
# crystal thue.cr examples/fizzbuzz.thue | head -n 100
begin
  prog.run
rescue e : IO::Error
  exit if e.os_error == Errno::EPIPE
  raise e
end

Step by step:

  • ThueRule represents a Thue rule like a::=b
  • ThueRule knows how to apply itself to a string at certain position, so it handles input and output itself
  • ThueProgram is defined as a collection of rules and initial state
  • ThueProgram#load knows how to read itself from Thue file
  • ThueProgram#random_match picks a random rule at random index according to fair algorithm, without building the whole list in memory
  • ThueProgram#run runs the program, starting from initial state
  • if we run it without passing Thue file name, we print short message and exit
  • if STDOUT gets closed, we catch an exception and check if it's the right error, then exit quietly. Crystal lacks any good way to quit on SIGPIPE the way Ruby's trap("PIPE", "EXIT") does it

We can confirm the program is working on a few test programs:

$ crystal thue.cr examples/fizzbuzz.thue | head -n 10
1
2
Fizz
4
Buzz
Fizz
7
8
Fizz
Buzz
$ crystal thue.cr examples/many_dice2.thue
6,4,2
$ DEBUG=1 crystal thue.cr examples/hello.thue
Applying rule Rule<_::=~Hello, World!> at 0 to "_"
Hello, World!

Better newline handling

The first issue with Thue I want to address is newline handling. A printing rule cannot print newlines, except the ~ rule which prints just the newline. It makes Thue I/O unnecessarily annoying. The easy way to fix it is by supporting escape codes in Rules. \n and \\ are all we need for now. But since we're fixing things we might just as well add one extra escape code \s for space. This is mainly because sometimes Thue code needs space at end of the line, and a lot of editors strip that automatically.

This only required rewriting ThueRule a little. I also added some more output triggered by ENV["DEBUG"].

class ThueRule
  @left : String
  @right : String
  getter :left

  def initialize(left, right)
    @left = process_backslashes(left)
    @right = process_backslashes(right)
    @right = "~\n" if @right == "~"
  end

  def apply(str, idx)
    before = str[0, idx]
    after = str[idx+@left.size .. -1]

    if @right[0] == '~'
      print @right[1..-1]
      replacement = ""
    elsif @right == ":::"
      replacement = STDIN.gets.not_nil!.chomp
    else
      replacement = @right
    end

    before + replacement + after
  end

  def to_s(io)
    io << "Rule<#{@left.inspect}::=#{@right.inspect}>"
  end

  def process_backslashes(s)
    s.gsub(/\\./) do |m|
      if m[1] == 'n'
        "\n"
      else
        m[1]
      end
    end
  end
end

Let's see how nice this is now. We can start with this dice rolling program in improved Thue:

_::=~How many dice do you want to roll (0-9)?\n
<A::=:::
<B::=~Here are your rolls:\n
0::=B
1::=B@
2::=B@@
3::=B@@@
4::=B@@@@
5::=B@@@@@
6::=B@@@@@@
7::=B@@@@@@@
8::=B@@@@@@@@
9::=B@@@@@@@@@
^@::=^%
%::=~1\n
%::=~2\n
%::=~3\n
%::=~4\n
%::=~5\n
%::=~6\n
^$::=~Thank you for playing ThueDice!\n
::=
^<<_A$
$ crystal thue_nl.cr examples_nl/many_dice.thue
How many dice do you want to roll (0-9)?
4
Here are your rolls:
2
5
6
3
Thank you for playing ThueDice!
$ crystal thue_nl.cr examples_nl/many_dice.thue
How many dice do you want to roll (0-9)?
1
Here are your rolls:
4
Thank you for playing ThueDice!
$ crystal thue_nl.cr examples_nl/many_dice.thue
How many dice do you want to roll (0-9)?
9
Here are your rolls:
3
6
6
2
3
6
3
3
1
Thank you for playing ThueDice!

We can also see the new debug output:

$ DEBUG=1 crystal thue_nl.cr examples_nl/many_dice.thue
Rule<"_"::="~How many dice do you want to roll (0-9)?\n">
Rule<"<A"::=":::">
Rule<"<B"::="~Here are your rolls:\n">
Rule<"0"::="B">
Rule<"1"::="B@">
Rule<"2"::="B@@">
Rule<"3"::="B@@@">
Rule<"4"::="B@@@@">
Rule<"5"::="B@@@@@">
Rule<"6"::="B@@@@@@">
Rule<"7"::="B@@@@@@@">
Rule<"8"::="B@@@@@@@@">
Rule<"9"::="B@@@@@@@@@">
Rule<"^@"::="^%">
Rule<"%"::="~1\n">
Rule<"%"::="~2\n">
Rule<"%"::="~3\n">
Rule<"%"::="~4\n">
Rule<"%"::="~5\n">
Rule<"%"::="~6\n">
Rule<"^$"::="~Thank you for playing ThueDice!\n">
Applying rule Rule<"_"::="~How many dice do you want to roll (0-9)?\n"> at 3 to "^<<_A$"
How many dice do you want to roll (0-9)?
Applying rule Rule<"<A"::=":::"> at 2 to "^<<A$"
2
Applying rule Rule<"2"::="B@@"> at 2 to "^<2$"
Applying rule Rule<"<B"::="~Here are your rolls:\n"> at 1 to "^<B@@$"
Here are your rolls:
Applying rule Rule<"^@"::="^%"> at 0 to "^@@$"
Applying rule Rule<"%"::="~4\n"> at 1 to "^%@$"
4
Applying rule Rule<"^@"::="^%"> at 0 to "^@$"
Applying rule Rule<"%"::="~5\n"> at 1 to "^%$"
5
Applying rule Rule<"^$"::="~Thank you for playing ThueDice!\n"> at 0 to "^$"
Thank you for playing ThueDice!
No more matchesn. Final state: ""

Interestingly this kind of debug output is much more difficult in the original Thue, as we'd be printing debug information between line contents and its newline.

And of course we can now have correct Hello, World! that prints that final newline

_::=~Hello, World!\n
::=
_
$ crystal thue_nl.cr examples_nl/hello.thue
Hello, World!

We can also fix the program to say hello to someone, these are updated rules, the other 166 are the same:

!<::=~Hello,\s
^*$::=~!\n
$ crystal thue_nl.cr examples_nl/hello2.thue
Alexander
Hello, Alexander!

Designing a better Thue

Now we face much bigger issue. Thue programs need crazy amount of repetition as every letter we pass needs its own rule. So for example the program that asks for your name (letters only), has rules like these 52 times each, once for every lowercase and uppercase letter:

a<::=<a
^*a::=^*>a
>a::=~a

FizzBuzz program likewise had rules like these, multiplied for every digit:

0✅::=0
🍙0::=0🍙
🖨0::=0🖨📄0
📄0::=~0

There were also 10 rules for adding 1 to a digit, which aren't quite the same, but there's a lot of redundancy:

0🧲::=✅1
1🧲::=✅2
2🧲::=✅3
3🧲::=✅4
4🧲::=✅5
5🧲::=✅6
6🧲::=✅7
7🧲::=✅8
8🧲::=✅9
9🧲::=🧲0

We'd really like to deal with this copypasta. The most obvious idea is to allow character ranges like [0-9] or [a-zA-Z]. But then would we also have to do parentheses () and backreferences \1 so we get rules like these?

([a-zA-Z])<::=<\1

So I had a different idea. The only thing we really need to match is character classes, so why not do this:

[a-zA-Z]<::=<[a-zA-Z]

Doing it like that has nice additional benefit that we can do shifted character classes very neatly, like these rules for addition:

[0-8]🧲::=✅[1-9]
9🧲::=🧲0

We can also repeat the same class multiple times:

🖨[0-9]::=[0-9]🖨📄[0-9]

This supports any N-to-N, and N-to-1 mapping. If you have something more complicated, then you'll need to be creative and use multiple rules. I think this is great common ground, where you don't get any crazy additional powers, but it also saves you from pointless copy&paste programming.

And to make it really clear, all such rules are expanded into their regular form. So 🖨[0-9]::=[0-9]🖨📄[0-9] actually becomes 10 regular Thue rules (well regular plus fixed newline handling).

An extra feature that I didn't really think about, is that you could have a range on the right side only, which in effect means random choice.

Some programs in Better Thue

Before I show the whole interpreter, let's just go through a few programs, as they're much more concise.

Here's the dice rolling program, it has 6 rules of what ROLL can turn into:

^roll::=^ROLL
^,::=^COMMA
^$::=~
ROLL::=~[1-6]
COMMA::=~,
::=
^roll,roll,roll$
$ crystal thue_rx.cr examples_rx/many_dice2.thue
1,5,3

Loop uses triple range rule (🖨[0-9]::=[0-9]🖨📄[0-9]) and different range rule ([0-8]🧲::=✅[1-9]):

# rules for adding one to a digit ::=
# only 9 is special ::=
[0-8]🧲::=✅[1-9]
9🧲::=🧲0
# if extra carry arrived, add leading 1 ::=
^🧲::=^✅1
# we're done, so just let ✅ pass through ::=
[0-9]✅::=✅[0-9]
# print the result ::=
# first activate each letter with 📄, but leave a copy behind ::=
^✅::=^🖨
🖨[0-9]::=[0-9]🖨📄[0-9]
# now print the activated letter ::=
📄[0-9]::=~[0-9]
# once printing finished, switch to increment mode 🧲
🖨$::=🏁🧲$
🏁::=~
::=
^🖨1$
$ crystal thue_rx.cr examples_rx/addone.thue
419
420

Hello says hello to you. It's the program that benefited the most:

_::=:::
# let the < pass through to the left ::=
[a-zA-Z]<::=<[a-zA-Z]
# if < reached ^ then we are ready to start printing ::=
# we need this step so it waits for user input before it starts to print ::=
^<::=^*!<
!<::=~Hello,\s
# activate print letter command ::=
^*[a-zA-Z]::=^*>[a-zA-Z]
# execute print letter ::=
>[a-zA-Z]::=~[a-zA-Z]
# we're done, print exclamation mark and newline ::=
^*$::=~!\n
::=
^_<$
$ crystal thue_rx.cr examples_rx/hello2.thue
Bitcoinella
Hello, Bitcoinella!

Loop loops forever:

# rules for adding one to a digit ::=
# only 9 is special ::=
[0-8]🧲::=✅[1-9]
9🧲::=🧲0
# if extra carry arrived, add leading 1 ::=
^🧲::=^✅1
# we're done, so just let ✅ pass through ::=
[0-9]✅::=✅[0-9]
# print the result ::=
# first activate each letter with 📄, but leave a copy behind ::=
^✅::=^🖨
🖨[0-9]::=[0-9]🖨📄[0-9]
# now print the activated letter ::=
📄[0-9]::=~[0-9]
# once printing finished, switch to increment mode 🧲
🖨$::=🏁🧲$
🏁::=~
::=
^🖨1$
$ crystal thue_rx.cr examples_rx/loop.thue | head -n 20
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

And FizzBuzz obviously FizzBuzzes:

# rules for adding one to a digit ::=
# only 9 is special ::=
[0-8]🧲::=✅[1-9]
9🧲::=🧲0
# if number ended but extra carry arrived, add leading 1 ::=
^🧲::=^✅1
,🧲::=,✅1
# If we reach a comma that means another number starts ::=
# we also want to increment it so switch back to 🧲 mode ::=
,✅::=🧲,
# we're done, so just let ✅ pass through ::=
[0-9]✅::=✅[0-9]
# start the printer ::=
# it's simpler to look at both counter at once ::=
# we reset the counters, then start printer in proper mode ::=
^✅::=^📆
📆1,1,::=1,1,🖨
📆2,1,::=2,1,🖨
📆3,1,::=0,1,🍙📄F
📆1,2,::=1,2,🖨
📆2,2,::=2,2,🖨
📆3,2,::=0,2,🍙📄F
📆1,3,::=1,3,🖨
📆2,3,::=2,3,🖨
📆3,3,::=0,3,🍙📄F
📆1,4,::=1,4,🖨
📆2,4,::=2,4,🖨
📆3,4,::=0,4,🍙📄F
📆1,5,::=1,0,🍙📄B
📆2,5,::=2,0,🍙📄B
📆3,5,::=0,0,🍙📄X
# if we have 📄F, 📄B, or 📄X, we just print that ::=
# then we pass 🍙 to the right without printing number ::=
📄F::=~Fizz
📄B::=~Buzz
📄X::=~FizzBuzz
🍙[0-9]::=[0-9]🍙
# print the result if we're in number print mode ::=
# first activate each letter with 📄, but leave a copy behind ::=
🖨[0-9]::=[0-9]🖨📄[0-9]
# now print the activated letter ::=
📄[0-9]::=~[0-9]
# once printing finished, switch to increment mode 🧲
🖨$::=🏁🧲$
🍙$::=🏁🧲$
🏁::=~
::=
^0,0,0🧲$
$ crystal thue_rx.cr examples_rx/fizzbuzz.thue | head -n 20
1
2
Fizz
4
Buzz
Fizz
7
8
Fizz
Buzz
11
Fizz
13
14
FizzBuzz
16
17
Fizz
19
Buzz

For FizzBuzz it looks like the list of 15 remainders combinations could use some more complex regexps to be made more readable, but I guess that's a puzzle you'll need to solve yourself.

Final Source Code

Here's the final source code:

class ThueRule
  getter :left

  def initialize(@left : String, @right : String)
    @right = "~\n" if @right == "~"
  end

  def apply(str, idx)
    before = str[0, idx]
    after = str[idx+@left.size .. -1]

    if @right[0] == '~'
      print @right[1..-1]
      replacement = ""
    elsif @right == ":::"
      replacement = STDIN.gets.not_nil!.chomp
    else
      replacement = @right
    end

    before + replacement + after
  end

  def to_s(io)
    io << "Rule<#{@left.inspect}::=#{@right.inspect}>"
  end
end

class ThueSideParser
  getter :results
  @toparse : Array(Char)

  def initialize(@str : String)
    @results = [""]
    @toparse = @str.chars
    parse
  end

  private def parse
    until @toparse.empty?
      case @toparse[0]
      when '['
        chars = parse_range
        if @results.size == 1
          @results = chars.map{|c| @results[0]+c}
        elsif @results.size == chars.size
          @results = @results.zip(chars).map{|s,c| s+c}
        else
          raise "Sizes of character classes mismatch in #{@str}"
        end
      else
        c = parse_character
        @results = @results.map{|s| s + c}
      end
    end
    @results
  end

  private def parse_character
    if @toparse[0] == '\\'
      @toparse.shift
      raise "Unmatched \\ in #{@str}" if eos?
      c = @toparse.shift
      case c
      when 'n'
        '\n'
      when 's'
        ' '
      else
        c
      end
    else
      @toparse.shift
    end
  end

  private def parse_range
    chars = [] of Char
    @toparse.shift
    loop do
      raise "Character range never closed in #{@str}" if eos?
      if @toparse[0] == ']'
        @toparse.shift
        return chars
      end
      c = parse_character
      raise "Character range never closed in #{@str}" if eos?
      if @toparse[0] == '-'
        @toparse.shift
        e = parse_character
        raise "Invalid character range in #{@str}" if e < c
        chars.concat(c..e)
      else
        chars << c
      end
    end
  end

  private def eos?
    @toparse.empty?
  end
end

class ThueRuleParser
  def initialize(@str : String)
    if @str =~ /\A(.*)::=(.*)\z/
      @valid = true
      @left = $1
      @right = $2
    else
      @left = ""
      @right = ""
      @valid = false
    end
  end

  def valid_rule?
    @valid
  end

  def empty_rule?
    @valid && @left.empty?
  end

  def call
    lefts = ThueSideParser.new(@left).results
    rights = ThueSideParser.new(@right).results

    # Support N-to-1 and 1-to-N rules
    lefts *= rights.size if lefts.size == 1
    rights *= lefts.size if rights.size == 1

    unless lefts.size == rights.size
      raise "Mismatched side of rule #{@str}"
    end

    lefts.zip(rights).map do |left, right|
      ThueRule.new(left, right)
    end
  end
end

class ThueProgram
  def initialize
    @rules = [] of ThueRule
    @initial = ""
    @state = ""
  end

  def load(path)
    lines = File.read_lines(path).map(&.chomp).zip(1..)

    while lines.size > 0
      line, line_no = lines.shift
      # Ignoring invalid rules, they are sometimes used as comments
      parser = ThueRuleParser.new(line)
      next unless parser.valid_rule?
      break if parser.empty_rule?
      @rules.concat parser.call
    end

    @initial = lines.map(&.first).join("\n")
  end

  def random_match
    match = nil
    matches_count = 0
    @rules.each do |rule|
      idx = 0
      loop do
        match_idx = @state.index(rule.left, idx)
        break unless match_idx
        idx = match_idx + 1
        matches_count += 1
        next unless rand(matches_count) == 0
        match = {rule: rule, idx: match_idx}
      end
    end
    match
  end

  def run(debug)
    @state = @initial
    if debug
      @rules.each do |rule|
        STDERR.puts rule
      end
    end

    while match = random_match
      rule = match[:rule]
      idx = match[:idx]
      if debug
        STDERR.puts "Applying rule #{rule} at #{idx} to #{@state.inspect}"
      end
      @state = rule.apply(@state, idx)
    end

    if debug
      STDERR.puts "No more matches. Final state: #{@state.inspect}"
    end
  end
end

unless ARGV[0]
  STDERR.puts "Usage: #{$0} <file.thue>"
  exit 1
end

prog = ThueProgram.new
prog.load(ARGV[0])

# Crystal doesn't handle SIGPIPE well and we want to support:
# crystal thue.cr examples/fizzbuzz.thue | head -n 100
begin
  prog.run(!!ENV["DEBUG"]?)
rescue e : IO::Error
  exit if e.os_error == Errno::EPIPE
  raise e
end

Next steps

The biggest issue with this interpreter is that every step checks every rule, so it's O(number of rules) * O(state size).

It would be fairly straightforward to compile list of states into a finite automaton without backtracking, even if specifics (uniformly random match) are a bit different from the usual matching algorithm, so it's technically not a DFA or an NFA.

Overall I think this language is a much better Thue than the original Thue. Writing actually programs would be just as much of a challenge, but amount of copying and pasting is dramatically lower.

Another Thue with regexps language is Rue, but it supports unlimited regexps, and that makes some coding challenges pretty much trivial.

As for coding it all in Crystal, it was generally a positive experience. The code contains very few pointless type annotations, and I didn't need to torture it into a different shape than I wanted just to make the type checker happy. I mostly chose it over Ruby, as I wanted to do the very fast finite automaton, but in the end I didn't quite get there.

Code

All code examples for the series will be in this repository.

Code for the Better Thue Interpreter in Crystal episode is available here.