100 Languages Speedrun: Episode 15: Awk

Awk is basically a proto-Perl. As Perl was one of the most influential language of all times (JavaScript, Ruby, and PHP are all Perl's direct descendants), Awk is indirectly quite historically important.

There hasn't been any good reason to use Awk for decades now. As I keep saying over and over, if you write anything nontrivial, just use a real programming language like Ruby, Python, or Perl. But it's still interesting for historical reason, so let's check what coding was like back in the 1980s.

Hello, World!

Awk scripts are a series of pattern { command }, where pattern is most often a regular expression. If script contains any such pattern, it will be executed on each line.

Here's one way to say Hello, World! in Awk:

#!/usr/bin/awk -f

/./ { print "Hello, " $1 "!" }
$ seq 1 5 | ./hello.awk
Hello, 1!
Hello, 2!
Hello, 3!
Hello, 4!
Hello, 5!
$ ./hello.awk
World
Hello, World!
Bob Ross
Hello, Bob!

So any line that contains non-whitespace characters will result in a hello. String concatenation is done by just putting a few strings next to each other. "Hello, " $1 "!" is what would be "Hello, " + $1 + "!" or "Hello, " . $1 . "!" or such in a more reasonable language.

Each line is $0, and it's also automatically split into words, so $1 means first word of currently processed lines, $2 means second word, etc. Those special variables are used for regular expression's first, second etc. match in Perl, Ruby, and some other languages, and I think that's where they came from.

Sum numbers from STDIN

There are some other patterns like BEGIN and END to do before and after processing lines. Here's a simple Awk program for adding all numbers, one per line:

#!/usr/bin/awk -f

BEGIN { x = 0 }
/[0-9]+/ { x += $1 }
END { print x }

Which works like this:

$ seq 10 20 | ./sum.awk
165

Awk has pre-Perl regular expressions, so things like \d don't work. That's another reason why it's better to use something more modern.

Awk's BEGIN { } and END { } blocks are still present in Perl, Ruby, and some other languages.

FizzBuzz with regexp

One way to do the FizzBuzz is to reuse our regular expressions from episode 7. At first you might think command block would just do { print "FizzBuzz" } or such, but then all the other blocks would match too (divisible by 15 is also divisible by 3 and 5 etc.). An easy way is to modify $0 variable (current line). We print it at the end.

#!/usr/bin/awk -f

/^(([0369]*[147]([258][0369]*[147]|[0369])*([258][0369]*[258]|[147])|[0369]*[258])(([147][0369]*[147]|[258])([258][0369]*[147]|[0369])*([258][0369]*[258]|[147])|([147][0369]*[258]|[0369]?))*(([147][0369]*[147]|[258])([258][0369]*[147]|[0369])*[258][0369]*|[147][0369]*)|([0369]*[147]([258][0369]*[147]|[0369])*[258][0369]*|[0369]*))0$/ { $0="FizzBuzz" }
/^(([0369]*[147]([258][0369]*[147]|[0369])*([258][0369]*[258]|[147])|[0369]*[258])(([147][0369]*[147]|[258])([258][0369]*[147]|[0369])*([258][0369]*[258]|[147])|([147][0369]*[258]|[0369]?))*(([147][0369]*[147]|[258])([258][0369]*[147]|[0369])*([258][0369]*[147]|[0369]?)|([147][0369]*[147]|[258]))|([0369]*[147]([258][0369]*[147]|[0369])*([258][0369]*[147]|[0369]?)|[0369]*[147]))5$/ { $0="FizzBuzz" }
/^.*[05]$/ { $0="Buzz" }
/^(([0369]*[147]([258][0369]*[147]|[0369])*([258][0369]*[258]|[147])|[0369]*[258])(([147][0369]*[147]|[258])([258][0369]*[147]|[0369])*([258][0369]*[258]|[147])|([147][0369]*[258]|[0369]?))*(([147][0369]*[147]|[258])([258][0369]*[147]|[0369])*[258][0369]*|[147][0369]*)|([0369]*[147]([258][0369]*[147]|[0369])*[258][0369]*|[0369]*))$/ { $0="Fizz" }
/./ { print $1 }

To use it:

$ seq 1 20 | ./fizzbuzz.awk
1
2
Fizz
4
Buzz
Fizz
7
8
Fizz
Buzz
11
Fizz
13
14
FizzBuzz
16
17
Fizz
19
Buzz

FizzBuzz

A less ridiculous version would be this:

#!/usr/bin/awk -f

$0 % 15 == 0 { print "FizzBuzz"; next }
$0 % 5 == 0 { print "Buzz"; next }
$0 % 3 == 0 { print "Fizz"; next }
{ print }

Any expression can be used as pattern. next prevents all other pattern checks for the current lines.

File output

Awk makes it really easy to print to files. This script sort out the input to odd.txt and even.txt:

#!/usr/bin/awk -f

/[13579]$/ { print >"odd.txt" }
/[02468]$/ { print >"even.txt" }

Like in shell > means overwrite the file, and >> means append. But while it might look like it will keep reopening and overwriting so you only see last line, each file will be opened just once:

 % seq 20 30 | ./file_output.awk
$ cat odd.txt
21
23
25
27
29
$ cat even.txt
20
22
24
26
28
30

And print without arguments is the same as print $0.

Pipe output

Even nicer, we can do similar redirection with pipes:

#!/usr/bin/awk -f

/[13579]$/ { print | "tac" }

Which matches all the lines with odd numbers and send them to tac program to print them in backward order.

$ seq 10 30 | ./reverse_odds.awk
29
27
25
23
21
19
17
15
13
11

Fibbonacci

Awk has normal function definitions. There's no distinction between number and string variables. If we put a command block without a pattern, it will match every line.

#!/usr/bin/awk -f

function fib(n) {
  if (n <= 2) {
    return 1;
  } else {
    return fib(n - 1) + fib(n - 2);
  }
}

{ print fib($1) }

Which does:

$ seq 1 20 | ./fib.awk
1
1
2
3
5
8
13
21
34
55
89
144
233
377
610
987
1597
2584
4181
6765

Rolling Dice

Awk has some trouble with command line arguments - it normally treats them as files to open. This code only works because we don't actually have any per-line patterns.

#!/usr/bin/awk -f

BEGIN {
  for(i=0; i<ARGV[2]; i++) {
    print int(rand() * ARGV[1]);
  }
}

We can use it to roll 5 100-sided dice:

$ ./dice.awk 100 5
84
39
78
79
91

Tally

Awk has associative arrays (nowadays usually called hashes or dictionaries).

#!/usr/bin/awk -f

{ tally[$0]++ }

END {
  for(n in tally) {
    print n, tally[n]
  }
}

Awk has no way to print regular arrays or associative arrays. If you try to print tally it will give you an error. It's another feature of modern programming languages that has roots in times of Awk, but is now done in much better ways.

$ ./dice.awk 6 100 | ./tally.awk
2 17
3 18
4 17
5 22
0 13
1 13

Should you use Awk?

No.

Special purpose languages have their place, but what Awk is doing - processing text files - is no longer "special purpose". Pretty much every modern language excels at processing text files and matching regular expressions, and handles everything Awk does a lot better.

Awk made a lot of sense back when its originated, as C was godawful at text processing, and Unix shell was godawful at writing any kind of structured programs, so Awk was addressing an obvious need. In modern times where every programmer is familiar with a language like Ruby, Python, Perl, or pretty much anything else that can process text, there's no place for Awk.

The language also definitely shows its age. Its regular expression engine is bad. It doesn't have console.log equivalent. It can't handle common text formats like CSV or JSON. It doesn't have sufficient Unicode capabilities. And so on. It does quite decently on conciseness, but only if you do exactly the kind of programs it likes - common requirements like parsing command line arguments will not work too well.

Awk is mainly of historical relevance, but it's not completely dead yet. If you work with a lot of Unix shell scripts, short Awk programs will be occasionally used there. I don't approve of this at all (seriously, just use real programming language like Ruby, Python, or Perl), but it might be useful to learn basics of Awk so you can read such shell code.

Code

All code examples for the series will be in this repository.

Code for the Awk episode is available here.