100 Languages Speedrun: Episode 08: Reverse Polish Notation Calculator

I couldn't decide between writing a series about trying out 100 programming languages, and a series about creating 100 programming languages, so it ended up as a bit of both.

This post will be our first attempt at creating a new programming language. But first, let's explain a few things.

How programming languages work

In principle, programming language can look like anything, but overwhelming majority of programming languages follow very strict structure:

input is one or more plain text files
a "tokenizer" or "lexer" turns such plain text into a list of "tokens" like 69.420, fizzbuzz, (, or "hello"
this list of "tokens" is then "parsed" into some fancy data structures with all the functions, statements, expressions, and so on
an interpreter or compiler actually runs the program

This is not the only way to do things. If you want to experience some other ways to program, check out Scratch), or Smalltalk images (you can also use Smalltalk with plain text files as well).

The Computer Science trap

Resources about creating programming languages often devolve into a lot of mathy Computer Science stuff about parsing. As someone who knows all that Computer Science stuff very well, I'm going to say it's all a distraction. You don't need any of that. Avoid all that stuff unless it's on your exam, and almost nobody actually uses that in real life.

Almost every real tokenizer is just a bunch of regular expressions and some duct tape, and almost every real parser is a custom made recursive descent code with enormous pile of hacks. Essentially none of the real programming languages in wide use actually conforms to Computer Science ideas like "context-free language" and such.

That's not to say there are zero uses for this, and you can see in episode 7 how this Computer Science magic can be used to create pure Regular Expression FizzBuzz. It just isn't required for creating your own language, and I don't think it's even all that enlightening. I think this focus has been mostly very harmful, making programming languages appear more mysterious than they really are, and scaring people into thinking that they cannot create their own language. You absolutely can, and you should give it a try.

Reverse Polish Notation

As it's our first episode of that, we'll do something really simple - the Reverse Polish Notation Calculator. It has very simple tokenizer we'll build, it doesn't actually need any parser, and then it has very simple interpreter.

Reverse Polish Notation is a way to write math without any parentheses. Any operation is placed after all its arguments. So 3 * (4 + 5) in traditional notation is 3 4 5 + * in Reverse Polish Notation. Parentheses are never needed in RPN, and wouldn't even mean anything.

The best thing about RPN is that we can calculate the result by just going from first to last token, without any complicated backtracking or parsing.

Any number is pushed onto a stack. Any operation like + or * pops the top two numbers from the stack, performs the operation on them, then pushes the result back. So 3 4 5 + * means:

push 3 (stack is now 3)
push 4 (stack is now 3, 4)
push 5 (stack is now 3, 4, 5)
add top two numbers 4 + 5 and push the result (stack is now 3 9)
multiply top two numbers 3 * 9 and push the result (stack is now 27)

Beyond numbers and math, we'll just add two extra instructions in for reading a single number from the user, and out for printing out a number.

This will not be a fully functional programming language, but there are languages like Forth) which expand on this idea into a fully featured language.

Tokenizer

Tokenizer is a program, or part of a program, which walks though the input from start to end, breaking it into small chunks (tokens).

Let's take this program for converting miles to kilometers:

in 1.60934 * out

Which in more familiar notation would be: out(in() * 1.60934).

It would be turned into 4 tokens. Tokens carry three pieces of information:

type
optional value
position (used just for error reporting)

So for our tokenizer, we want to turn this code into something like:

{type: "in" }
{type: "number", value: 1.60934 }
{type: "*" }
{type: "out" }

Or if we also wanted position information:

{type: "in", position: {line_start: 1, char_start: 1, line_end: 1, char_end: 2}}
{type: "number", value: 1.60934, position: {line_start: 1, char_start: 4, line_end: 1, char_end: 10}}
{type: "*", position: {line_start: 1, char_start: 12, line_end: 1, char_end: 12}}
{type: "out", position: {line_start: 1, char_start: 14, line_end: 1, char_end: 16}}

How to write tokenizers

Unless your language is really complicated, a tokenizer is basically a list of regular expressions describing shape of a token, and small bit of code associated with each of them to specify token type and possibly value.

There are whole small languages dedicated for writing such tokenizers. For examples Flex) for C lets you specify such regexp and rule list, even though C language doesn't natively support regular expressions on its own:

%{
#include "y.tab.h"
%}

digit         [0-9]

%%
"+"                  { return PLUS; }
"-"                  { return MINUS; }
"*"                  { return TIMES; }
"/"                  { return SLASH; }
{digit}+             { yylval.num = atoi(yytext);
                       return NUMBER; }
[ \t\n\r]            /* skip whitespace */
.                    { printf("Unknown character [%c]\n",yytext[0]);
                       return UNKNOWN; }
%%

But if we work with a language like Ruby, JavaScript, or Python which has native support for regular expressions, we don't even need that.

Tokenizer with `StringScanner`

Ruby comes with StringScanner library, which lets us build tokenizers very easily. The main thing it does beyond basic Ruby functionality is manage where we currently are in the input, so we don't need to manually manage which parts of the input we scanned and which we didn't.

#!/usr/bin/env ruby

require "strscan"

class RPNTokenizer
  def call(string)
    scanner = StringScanner.new(string)
    tokens = []
    until scanner.eos?
      if scanner.scan(/\s+/)
        # do nothing
      elsif scanner.scan(/-?\d+(?:\.\d*)?/)
        tokens << { type: "number", value: scanner.matched.to_f }
      elsif scanner.scan(/[\+\-\*\/]|in|out/)
        tokens << { type: scanner.matched }
      else
        raise "Invalid character #{scanner.peek}"
      end
    end
    tokens
  end
end

pp RPNTokenizer.new.call("1 -2 3.5 + * in - out")

And it prints just what we expect:

[{:type=>"number", :value=>1.0},
 {:type=>"number", :value=>-2.0},
 {:type=>"number", :value=>3.5},
 {:type=>"+"},
 {:type=>"*"},
 {:type=>"in"},
 {:type=>"-"},
 {:type=>"out"}]

Notable thing is that -2 could be interpreted as either -2 token or two tokens - and 2. Our code does the check in order we put the regexps, so number check happen first, so it comes out as -2. Many tokenizer tools have a different logic and instead of having top to bottom priority, they always go for the longest match, which is generally also good default.

Tokenizer with plain Ruby

Just to show that StringScanner is nothing special, let's reimplement this in plain Ruby:

#!/usr/bin/env ruby

class RPNTokenizer
  def call(string)
    tokens = []
    until string.empty?
      if string =~ /\A\s+/
        string = string[$&.size..-1]
      elsif string =~ /\A-?\d+(?:\.\d*)?/
        tokens << { type: "number", value: $&.to_f }
        string = string[$&.size..-1]
      elsif string =~ /\A[\+\-\*\/]|in|out/
        tokens << { type: $& }
        string = string[$&.size..-1]
      else
        raise "Invalid character #{string[0]}"
      end
    end
    tokens
  end
end

pp RPNTokenizer.new.call("1 -2 3.5 + * in - out")

Not much changed:

we use $& regexp variable for "what matched" (we can also run $1, $2, etc.)
we cut the string after each token, so string only contains the part that wasn't parsed yet (with string = string[$&.size..-1])
every regexp must start with \A (match this at start of the string)

It's a bit more verbose, but there's nothing conceptually new here.

Run the program

For most languages there would be "parsing" step here, but RPN is simple enough we can run the lists of tokens directly, so let's just finish the language:

#!/usr/bin/env ruby

require "strscan"
require "pathname"

class RPNTokenizer
  def call(string)
    scanner = StringScanner.new(string)
    tokens = []
    until scanner.eos?
      if scanner.scan(/\s+/)
        # do nothing
      elsif scanner.scan(/-?\d+(?:\.\d*)?/)
        tokens << { type: "number", value: scanner.matched.to_f }
      elsif scanner.scan(/[\+\-\*\/]|in|out/)
        tokens << { type: scanner.matched }
      else
        raise "Invalid character #{scanner.peek}"
      end
    end
    tokens
  end
end

class RPN
  def initialize(program_path)
    @string = Pathname(program_path).read
    @program = RPNTokenizer.new.call(@string)
  end

  def call
    stack = []
    @program.each do |token|
      case token[:type]
      when "number"
        stack << token[:value]
      when "+"
        b, a = stack.pop, stack.pop
        stack << a + b
      when "*"
        b, a = stack.pop, stack.pop
        stack << a * b
      when "-"
        b, a = stack.pop, stack.pop
        stack << a - b
      when "/"
        b, a = stack.pop, stack.pop
        stack << a / b
      when "in"
        stack << STDIN.readline.to_f
      when "out"
        puts stack.pop
      else
        raise "Invalid token #{token}"
      end
    end
  end
end

RPN.new(ARGV[0]).call

And we can verify that it work:

$ cat mile_to_km.rpn
in 1.60934 * out
$ ./rpn.rb mile_to_km.rpn
20
32.1868
$ cat  temperature_c_to_f.rpn
in 9 5 / * 32 + out
$ ./rpn.rb temperature_c_to_f.rpn
20
68.0

By the way this is intentionally written in fairly verbose way for added clarity.

The reversed B, A order is important for subtraction and division. By traditional RPN convention, 9 5 / means 9/5 not 5/9, even though in principle either of these conventions would be fine.

What should you do next?

I did this in Ruby, but any language with regular expressions will do. Just pick your favorite language, find a regular expression library if you need one, and write some RPN calculator in it. That can be your first programming language!

It's very easy to customize it by adding more types of tokens. Add some extra math (**, %, sin), maybe think about how to add strings, or variables. For something more advanced, can you think of any ways to add loops or defining custom functions? How much would you need to make this language able to do Hello World, Fib, and FizzBuzz?

Code

All code examples for the series will be in this repository.

Code for the Reverse Polish Notation Calculator episode is available here.