100 Languages Speedrun: Episode 44: RISC-V Assembly
The world of computing is dominated by two instruction set architectures, x86-64 (which I covered in episode 40) and ARM, with all other architectures irrelevant by now.
But a new contender recently emerged, RISC-V. RISC-V attempts to be an Open Source and architecturally neutral ISA. The goal isn't for the hardware itself to be Open Source, it is for any company or researcher to be able to make their own RISC-V-based processor, with whichever extra features and performance trade-offs they need, and to have them be interoperable thanks to shared RISC-V standard.
So far it's a very distant third, but RISC-V based products keep showing up, especially in embedded space, and it is obviously making ARM really uncomfortable.
How to run RISC-V code
As you probably don't have any RISC-V computers at hand, the best way is to use QEMU emulation and Docker. If you run docker run -it riscv64/ubuntu
, you can get RISC-V environment. At least on OSX, Docker comes with everything preconfigured for it, and it shouldn't be too hard on other systems. This is of course much slower than the real thing would be.
So let's start a new container and install compiler tools on it:
$ docker run -it --platform linux/riscv64 -v $(pwd):/source riscv64/ubuntu
$ apt update
$ apt install -y build-essential
Simplest RISC-V program
Let's write and compile our first RISC-V program. It will just tell the operating system to exit, with exit code 7:
.text
.global _start
_start:
/* Tell the operating system to exit with code 7 */
li a7, 93
li a0, 7
ecall
The x86-64 version we had was very similar:
global start
section .text
start:
; Tell operating system to exit with code 7
mov rax, 0x2000001
mov rdi, 7
syscall
And let's run it:
$ as exit.s -o exit.o
$ ld -static exit.o -o exit
$ ./exit
$ echo $?
7
Step by step:
- these are different assemblers (GNU as vs NASM), different operating systems (Linux vs OSX), and different architectures (RISC-V vs x86-64), but it's so damn similar
.text
(NASM.section text
)- I'm not really sure why
start
vs_start
as default start symbol for statically linked ecall
(x86syscall
) - call operating system function.text
(orsection .text
in NASM) - where the code isli
- load integer, on RISCV it only supports very small numbers with a single opcode (12bit), and if you want to load 64bit number you'd need multiple instructions, on x86-64 you can load number of up to 64bit size as x86-64 supports much more complex instructionsa7
(x86-64rax
) is where we choose the operating system function to call, in this case exit- numbers of system calls vary both by operating system and architecture, on x86-64 OSX it's
0x2000001
, on RISC-V Linux it's93
. a0
(x86-64rdi
) is where the first argument goes, exit has only one argument, where we pass7
- weirdly assembler comment syntax is different on different architectures, and
;
comments don't work on RISC-V version of GNU as.
Hello, World!
Now that we know how to run RISC-V code, let's write a simple program that prints "Hello, World!" to the screen.
It needs to do two system calls
write
call, passing file descriptor number (1
for standard output), address of"Hello, World!\n"
string, and length of the string14
.exit
call, passing0
as exit code to indicate success
.text
.global _start
_start:
/* Tell the operating system to write "Hello, World!\n" to stadard output */
li a7, 64
li a0, 1 /* standard output */
lla a1, hello /* address of thing to write */
li a2, 14 /* amount of data to write */
ecall
/* Tell the operating system to exit with code 0 */
li a7, 93
li a0, 0
ecall
.data
hello:
.ascii "Hello World!\n"
It works just like we'd expect:
$ as hello.s -o hello.o
$ ld -static hello.o -o hello
$ ./hello
Hello World!
But hang on, how does RISC-V load an address if addresses are 64bit, and it cannot load such big numbers, let's disassemble it and take a peak:
$ objdump -d hello
hello: file format elf64-littleriscv
Disassembly of section .text:
00000000000100b0 <_start>:
100b0: 04000893 li a7,64
100b4: 00100513 li a0,1
100b8: 00001597 auipc a1,0x1
100bc: 01c58593 addi a1,a1,28 # 110d4 <__DATA_BEGIN__>
100c0: 00e00613 li a2,14
100c4: 00000073 ecall
100c8: 05d00893 li a7,93
100cc: 00000513 li a0,0
100d0: 00000073 ecall
As you can see, each instruction is exactly 32bits, or 4 bytes. As some of those bits must identify which instruction we want, it's not possible to load 32bit number, let alone 64bit number, with one instruction.
x86-64 on the other hand can absolutely do that, as instructions on x86-64 have variable width, and some of them can be very long. Apparently the longest valid instruction is 15 bytes, but you're not likely to see many such instructions, and most instructions are a lot shorter. The most common 10 bytes instruction is to load a 64 bit number into a register (2 bytes to select such an instruction, then 8 bytes of data) - but if top half of that number is all zeroes or ones, it will be a lot shorter.
x86-64 style variable length instruction encoding tends to use a lot less memory, and due to limited size of CPU's innermost caches, it is generally a lot faster. RISC-V constant length instruction encoding tends to be a lot simpler to implement, but more instructions will be required, and less of a program will fit in a cache, so generally such choices result in poorer performance. Of course in practice a lot of other factors affect performance as well.
Anyway, how does RISC-V load that address? It translated our lla
into two instructions:
auipc a1,0x10
(Add Upper Immediate to Program Counter) checks current "program counter" and adds0x1 * 4096
to it, and saves it toa1
addi a1,a1,28
(add immediate) - adds28
(numbers from0
to4095
fit in the instruction) toa1
and saves result ina1
- so as result, we have
a1
equal topc + 4096 + 28
, presumably covering distance between where that instruction was, and where"Hello, World!\n"
is located in the program - this pair of instructions can load any address within 32bit from PC, more instructions would be needed for 64bit
- very similar technique is used for loading 32bit numbers - first
lui
to load4096 * constant
, thenaddi
to add the final digits to that
This kind of relative loading is a pretty decent idea - memory might be huge, but programs tend to be small, so distance between instruction and constant it refers should generally fit 32bit.
This also illustrates one interesting idea RISC-V has - on surface level, it only has constant length instructions, and for simpler implementations that's it, but the idea is that more complex high-performance implementations could look at multiple instructions together, and treat such a common pair like auipc+addi
or lui+addi
as a single double-length instruction to load a 32bit number, instead of following each step separately. How well that's going to work in practice is a big unknown.
Loop
The loop is very straightforward:
.text
.global _start
_start:
/* initialize iteration count to 5 */
li s0, 5
loop:
/* print one "Hello, World!\n" */
li a7, 64
li a0, 1 /* standard output */
lla a1, hello /* address of thing to write */
li a2, 14 /* amount of data to write */
ecall
/* subtract 1, check if s0 reached 0 */
addi s0, s0, -1
bnez s0, loop
done:
li a7, 93
li a0, 0
ecall
.data
hello:
.ascii "Hello World!\n"
We use s0
to store how many loop iterations are remaining.
There are two interesting quirks of RISC-V. It doesn't have any "subtract constant" operation, instead it adds a negative number.
And second, there are no "CPU flags" - on x86-64 there was an operation to compare numbers that set some flags, then there was conditional jump based on those flags. RISC-V doesn't have that design, instead it uses "compare and jump" (with all the comparison flavors) as a single instruction. CPU flags were very easy to implement on simple CPUs, but they're really complicate things on modern high performance CPUs where multiple operations can happen in parallel while CPU needs to manage the results as if it was executing things one at a time.
Print numbers
On our way to FizzBuzz, we need to be able to print a number. This is very similar algorithm to the one I wrote for x86-64. We start building the string from the back, one digit at a time. At every iteration we do number % 10
, add 48 to get corresponding ASCII code, and store that in a string. Then we divide number
by 10, and repeat the process.
The GNU assembler equivalent of NASM macros I used buffer_last_byte: equ $ - 1
and buffer_after: equ $
don't work correctly here, it looks like some linker issue. I didn't investigate it further, and just added extra operation to add 31
to the buffer address.
RISC-V
.text
.global _start
print_number:
/* a1 = address of the last character of the buffer (excluding newline) */
/* a2 = number of characters to print (including newline) */
/* a3 = 10 */
lla a1, buffer
addi a1, a1, 31
li a2, 2
li a3, 10
print_number_loop_iteration:
/* split last digit out */
/* a4 = a0 / 10 */
div a4, a0, a3
/* a5 = a0 % 10 */
rem a5, a0, a3
/* store one character at the address */
addi a5, a5, 48
sb a5, (a1)
beqz a4, print_number_loop_done
mv a0, a4 /* a0 = a0/10, that is remove last digit */
addi a1, a1, -1 /* move buffer back one character */
addi a2, a2, 1 /* increase number of characters to print by one */
j print_number_loop_iteration
print_number_loop_done:
/* now we can tell the operating system to print string we built */
li a7, 64
li a0, 1
ecall
/* and return */
ret
_start:
/* set a0 to be argument ad call print_number function */
li a0, 12345678
call print_number
/* Tell the operating system to exit with code 0 */
li a7, 93
li a0, 0
ecall
.data
/* 32 characters + \n; initial contents do not matter */
buffer:
.ascii "0123456789abcdef0123456789abcdef\n"
Print numbers with loop
It takes very little extra work to loop all numbers from 1 to 20 instead of printing just one.
.text
.global _start
print_number:
/* a1 = address of the last character of the buffer (excluding newline) */
/* a2 = number of characters to print (including newline) */
/* a3 = 10 */
lla a1, buffer
addi a1, a1, 31
li a2, 2
li a3, 10
print_number_loop_iteration:
/* split last digit out */
/* a4 = a0 / 10 */
div a4, a0, a3
/* a5 = a0 % 10 */
rem a5, a0, a3
/* store one character at the address */
addi a5, a5, 48
sb a5, (a1)
beqz a4, print_number_loop_done
mv a0, a4 /* a0 = a0/10, that is remove last digit */
addi a1, a1, -1 /* move buffer back one character */
addi a2, a2, 1 /* increase number of characters to print by one */
j print_number_loop_iteration
print_number_loop_done:
/* now we can tell the operating system to print string we built */
li a7, 64
li a0, 1
ecall
/* and return */
ret
_start:
/* set a0 to be argument ad call print_number function */
li s0, 1
li s1, 20
loop:
mv a0, s0
call print_number
addi s0, s0, 1
ble s0, s1, loop
/* Tell the operating system to exit with code 0 */
li a7, 93
li a0, 0
ecall
.data
/* 32 characters + \n; initial contents do not matter */
buffer:
.ascii "0123456789abcdef0123456789abcdef\n"
RISC-V has a lot of registers, and by convention some of them (a0-a7) are used to pass arguments to functions, and can be overwritten by the function. Others (s0-s11) should be saved and restored if a function wants to use them. This is just a convention, not any hard requirement, but we follow it here, as we store data we want to not get overwritten in s0
and s1
.
FizzBuzz
I then wrote a perfectly fine FizzBuzz program, but the linker really hated it. I think this is a linker bug as lla
is supposed to always use relative addressing. Or assembler incorrectly informs linker about this. Or maybe it's supposed to not work, and need some extra flags, I'm not really sure.
Anyway, it all worked when I changed the command from static to dynamic linking:
$ as fizzbuzz.s -o fizzbuzz.o
$ ld -fPIC -shared fizzbuzz.o -o fizzbuzz
$ ./fizzbuzz
1
2
Fizz
4
Buzz
Fizz
7
8
Fizz
Buzz
11
Fizz
13
14
FizzBuzz
16
17
Fizz
19
Buzz
...
Fizz
97
98
Fizz
Buzz
And here's the program:
.option pic
.text
.global _start
print_number:
/* a1 = address of the last character of the buffer (excluding newline) */
/* a2 = number of characters to print (including newline) */
/* a3 = 10 */
lla a1, .buffer
addi a1, a1, 31
li a2, 2
li a3, 10
print_number_loop_iteration:
/* split last digit out */
/* a4 = a0 / 10 */
div a4, a0, a3
/* a5 = a0 % 10 */
rem a5, a0, a3
/* store one character at the address */
addi a5, a5, 48
sb a5, (a1)
beqz a4, print_number_loop_done
mv a0, a4 /* a0 = a0/10, that is remove last digit */
addi a1, a1, -1 /* move buffer back one character */
addi a2, a2, 1 /* increase number of characters to print by one */
j print_number_loop_iteration
print_number_loop_done:
/* now we can tell the operating system to print string we built */
li a7, 64
li a0, 1
ecall
/* and return */
ret
_start:
/* set a0 to be argument ad call print_number function */
li s0, 1
li s1, 100
li s3, 3
li s5, 5
loop:
rem a3, s0, s3
rem a5, s0, s5
beqz a3, divides_by_three
beqz a5, divides_by_five_only
divides_neither:
mv a0, s0
call print_number
j continue_loop
divides_by_three:
beqz a5, divides_by_three_and_five
divides_by_three_only:
li a7, 64
li a0, 1
lla a1, .fizz
li a2, 5
ecall
j continue_loop
divides_by_five_only:
li a7, 64
li a0, 1
lla a1, .buzz
li a2, 5
ecall
j continue_loop
divides_by_three_and_five:
li a7, 64
li a0, 1
lla a1, .fizzbuzz
li a2, 9
ecall
continue_loop:
addi s0, s0, 1
ble s0, s1, loop
/* Tell the operating system to exit with code 0 */
li a7, 93
li a0, 0
ecall
.data
/* 32 characters + \n; initial contents do not matter */
.buffer:
.ascii "0123456789abcdef0123456789abcdef\n"
.fizz:
.ascii "Fizz\n"
.buzz:
.ascii "Buzz\n"
.fizzbuzz:
.ascii "FizzBuzz\n"
Should you use RISC-V?
As for the hardware, only the future will tell.
As for RISC-V assembly, it's not something you're likely to ever need, even less so than x86-64 assembly, but it's good fun to play with it if you like esoteric languages, and now thanks to Docker and QEMU it's quite easy.
There's some rumor that RISC-V alternative to ARM-based Raspberry Pi is coming real soon now so maybe you'll even be able to run your code on real hardware.
Code
All code examples for the series will be in this repository.