100 Languages Speedrun: Episode 46: ARM64 Assembly
There aren't that many relevant CPU architectures these days. x86-64 is dominant on the high-performance devices, ARM is dominant on the low-power devices, and RISC-V is the only serious upcoming challenger. It's been like that for about 20 years now, more or less since the time Intel's Itanium architecture launched and instantly failed.
I did x86-64](taw.hashnode.dev/100-languages-speedrun-epi..) and RISC-V(taw.hashnode.dev/100-languages-speedrun-epi..), so it's time to complete the set with ARM.
Just like x86 and RISC-V, ARM has both 32-bit and 64-bit versions, and we'll be doing 64bit.
How to run ARM64 in Docker
Docker comes with QEMU preinstalled, so it can emulate other architectures, and I'll be using that.
You might actually have an ARM machine around, like a Raspberry Pi or Apple M1. If you follow along you'd need to adjust it slightly. Most Raspberries are 32bit and only some 64bit versions started showing up recently. Apple M1 is 64bit, but OSX uses slightly different system calls than Linux. The code from this episode could work with minor adaptations on either.
So let's start a new container, install compiling tools on it, and we can begin:
$ docker run -it --platform linux/arm64/v8 -v $(pwd):/source arm64v8/ubuntu
$ apt update
$ apt install -y build-essential
Simplest ARM64 program
The simplest program just exits with a numeric error code. These are normally used to indicate errors, with 0 being success, and various non-zero values indicating failure. Some programs have complicated mapping which non-zero value means which kind of issue, others just use same value for every problem.
To exit, or to do any interaction with the outside world, our program needs to call the operating system. This is how it's done:
.global _start
.text
_start:
/* _exit(7) */
mov x8, 93
mov x0, 7
svc 0
$ as exit.s -o exit.o
$ ld exit.o -o exit
$ ./exit
$ echo $?
7
This program is really similar to x86-64 and RISC-V versions, it's just the opcode names that are different.
.text
is code section_start
is where the code execution will begin when the program is loaded, we need to mark this symbol as exported with.global _start
too/* ... */
for comments, I'm so baffled that assemblers for different architectures use different comment charactersmov x8, 93
meansx8 = 93
- thex8
register is where we pass function number to the operating system,93
is Linux system call number for exit, just as on other architectures with Linux. On OSX it would be some different number.mov x0, 7
meansx0 = 7
- thex0
register is where we pass the first argument, in this case the exit codesvc 0
performs the system call on Linux
Simplest OSX ARM64 program
I don't have one lying around, so I never tried running it, but from the documentation, this is how the same program would look on OSX:
.global _start
.text
_start:
/* _exit(7) */
mov x16, 1
mov x0, 7
svc 0x80
A few things are different:
- system call number is different,
exit
is93
on ARM64 Linux and1
on ARM64 OSX. Interestingly it's93
on all Linuxes we checked, but x86-64 OSX uses different numbers from ARM64 OSX. - we pass the operation number in
x16
instead ofx8
- we use
svc 0x80
notsvc 0
to call the operating system
Hello, World!
The Hello, World also looks similar to x86-64 and RISC-V versions, but there's some real differences below the surface we'll get to:
.global _start
.text
_start:
/* write(1, "Hello, World!\n", 14) */
mov x8, 64
mov x0, 1
ldr x1, =hello
mov x2, 14
svc 0
/* _exit(7) */
mov x8, 93
mov x0, 7
svc 0
.data
hello:
.ascii "Hello, World!\n"
Let's run in:
$ as hello.s -o hello.o
$ ld hello.o -o hello
$ ./hello
Hello, World!
mov x8, 64
meansx8 = 64
, that's Linux operating system number for thewrite
functionmov x0, 1
meansx0 = 1
, that means standard outputldr x1, =hello
meansx1 = address of hello
, but there's more heremov x2, 14
meansx2 = 14
, that's the length of the string.data
is data sectionhello
is a label for where we have the string.ascii "Hello, World!\n"
is the string itself
Constant pools
Assemblers don't pass strings and other such objects around, they pass their addresses in memory. On 64-bit machine, addresses are 64 bits, or 8 bytes. So how do we load an address into memory?
On x86-64 it's super easy - instructions have variable length, so if you need to load a 64-bit address or any other 64-bit number into a register, you can use mov
instruction, and it will be 10 bytes (2 to select instruction, 8 for data), but that's fine.
ARM64 instructions are all 32bit. Part of that needs to select which instruction we use, so an instruction can't contain a 32bit number, let alone a 64bit one. So how does that work?
Enough talk, let's take a peak inside with objdump
. -d
means to disassemble, -s
to show contents of each section:
$ objdump -ds ./hello
./hello: file format elf64-littleaarch64
Contents of section .text:
4000b0 080880d2 200080d2 c1000058 c20180d2 .... ......X....
4000c0 010000d4 a80b80d2 e00080d2 010000d4 ................
4000d0 d8004100 00000000 ..A.....
Contents of section .data:
4100d8 48656c6c 6f2c2057 6f726c64 210a Hello, World!.
Disassembly of section .text:
00000000004000b0 <_start>:
4000b0: d2800808 mov x8, #0x40 // #64
4000b4: d2800020 mov x0, #0x1 // #1
4000b8: 580000c1 ldr x1, 4000d0 <_start+0x20>
4000bc: d28001c2 mov x2, #0xe // #14
4000c0: d4000001 svc #0x0
4000c4: d2800ba8 mov x8, #0x5d // #93
4000c8: d28000e0 mov x0, #0x7 // #7
4000cc: d4000001 svc #0x0
4000d0: 004100d8 .word 0x004100d8
4000d4: 00000000 .word 0x00000000
So:
- data ended up at address
0x00000000004100d8
- the only data entry is our string there - code ended up at address
0x00000000004000b0
- the only function is our_start
function - but there's something weird following the
_start
function, there's a big number0x00000000004100d8
(objdump
splits it between two lines, but it's a single 64-bit number) ldr x1, =hello
got translated toldr x1, 4000d0
- but that decoding not completely accurate, what's actually in the instruction is address of what we're loading, relative to address of the current instruction - and the whole memory might be many GBs, but the constant pool is very close to the function itself, so the small offset that fits in the instruction is generally enough.- so address of
hello
isn't anywhere in the code, it's in constant pool just after the function, and the code loads address ofhell
from the constant pool
Constant pools aren't the only way to load big numbers on ARM64, you can also use 4 instructions to load 4 16-bit chunks with movk
, but that would generally be slower and take even more space - 4 instructions and 16 bytes instead of 1 instruction and 12 bytes.
The whole situation is definitely more complicated than on x86-64. Normally the compiler deals with all that complexity for you, and in this case even assembler does some of it for you.
Loop
Loops are very straightforward. Just like x86-64 and unlike RISC-V, ARM64 has separate flags register so first we compare with one instruction that sets some flags, then we do a conditional jump with another instruction.
This code keeps loop counter in x19
, it starts at 5
, goes down by 1
every iteration, and the loop ends when it reaches 0
.
.global _start
.text
_start:
mov x19, 5
loop:
/* write(1, "Hello, World!\n", 14) */
mov x8, 64
mov x0, 1
ldr x1, =hello
mov x2, 14
svc 0
/* x19 = x19 - 1 */
sub x19, x19, 1
/* if x19 != 0 goto loop for another iteration */
cmp x19, 0
b.ne loop
/* _exit(7) */
mov x8, 93
mov x0, 7
svc 0
.data
hello:
.ascii "Hello, World!\n"
It indeed it prints the message 5 times:
$ as loop.s -o loop.o
$ ld loop.o -o loop
$ ./loop
Hello, World!
Hello, World!
Hello, World!
Hello, World!
Hello, World!
Print numbers
As usual, the most challenging part is converting numbers to strings. It's the same algorithm - building the string digit by digit starting from the last one. I put some comments all over the code, hopefully it should be clear enough.
.global _start
.text
print_number:
/* start with x1 pointing at last character of the buffer */
/* that's where the digit will go */
/* x2 is total count of characters to print (including newline) */
ldr x1, =buffer
add x1, x1, 31
mov x2, 2
mov x3, 10
print_number_loop:
/* do one digit, shift x0 */
/* x4 = x0/10 */
/* x5 = x0%10 */
sdiv x4, x0, x3
/* ARM doesn't have a modulo instruction, but it has "multiply and subtract" instruction */
msub x5, x4, x3, x0
/* add 48 to convert number to ASCII code, then write to buffer */
add x5, x5, 48
/* strb = SToRe Byte */
/* w5 is bottom 32bits of x5 */
/* it doesn't really matter, as we're only writing the lowest byte */
strb w5, [x1]
/* check if x4 is 0 */
/* if yes, we're done and can print what we built */
/* if not, more digits are coming */
cmp x4, 0
b.eq print_number_loop_done
mov x0, x4
sub x1, x1, 1
add x2, x2, 1
b print_number_loop
print_number_loop_done:
/* output some part of the buffer */
/* write(1, x1, x2) */
mov x8, 64
mov x0, 1
svc 0
ret
_start:
/* load big_number from the constant pool to x0 */
ldr x0, =big_number
/* call print_number */
bl print_number
/* _exit(7) */
mov x8, 93
mov x0, 7
svc 0
.data
big_number = 12345678901234
/* just put some random stuff in the buffer */
/* we'll overwrite it before printin anyway (except final \n) */
buffer:
.ascii "0123456789abcdef0123456789abcdef\n"
That works just as expected:
$ as print_number.s -o print_number.o
$ ld print_number.o -o print_number
$ ./print_number
12345678901234
Let's look inside too:
$ objdump -ds print_number
print_number: file format elf64-littleaarch64
Contents of section .text:
4000b0 01030058 217c0091 420080d2 430180d2 ...X!|..B...C...
4000c0 040cc39a 8580039b a5c00091 25000039 ............%..9
4000d0 9f0000f1 a0000054 e00304aa 210400d1 .......T....!...
4000e0 42040091 f7ffff17 080880d2 200080d2 B........... ...
4000f0 010000d4 c0035fd6 00010058 edffff97 ......_....X....
400100 a80b80d2 e00080d2 010000d4 00000000 ................
400110 20014100 00000000 f22fce73 3a0b0000 .A....../.s:...
Contents of section .data:
410120 30313233 34353637 38396162 63646566 0123456789abcdef
410130 30313233 34353637 38396162 63646566 0123456789abcdef
410140 0a .
Disassembly of section .text:
00000000004000b0 <print_number>:
4000b0: 58000301 ldr x1, 400110 <_start+0x18>
4000b4: 91007c21 add x1, x1, #0x1f
4000b8: d2800042 mov x2, #0x2 // #2
4000bc: d2800143 mov x3, #0xa // #10
00000000004000c0 <print_number_loop>:
4000c0: 9ac30c04 sdiv x4, x0, x3
4000c4: 9b038085 msub x5, x4, x3, x0
4000c8: 9100c0a5 add x5, x5, #0x30
4000cc: 39000025 strb w5, [x1]
4000d0: f100009f cmp x4, #0x0
4000d4: 540000a0 b.eq 4000e8 <print_number_loop_done> // b.none
4000d8: aa0403e0 mov x0, x4
4000dc: d1000421 sub x1, x1, #0x1
4000e0: 91000442 add x2, x2, #0x1
4000e4: 17fffff7 b 4000c0 <print_number_loop>
00000000004000e8 <print_number_loop_done>:
4000e8: d2800808 mov x8, #0x40 // #64
4000ec: d2800020 mov x0, #0x1 // #1
4000f0: d4000001 svc #0x0
4000f4: d65f03c0 ret
00000000004000f8 <_start>:
4000f8: 58000100 ldr x0, 400118 <_start+0x20>
4000fc: 97ffffed bl 4000b0 <print_number>
400100: d2800ba8 mov x8, #0x5d // #93
400104: d28000e0 mov x0, #0x7 // #7
400108: d4000001 svc #0x0
40010c: 00000000 .inst 0x00000000 ; undefined
400110: 00410120 .word 0x00410120
400114: 00000000 .word 0x00000000
400118: 73ce2ff2 .word 0x73ce2ff2
40011c: 00000b3a .word 0x00000b3a
The .data
section contains our buffer and nothing else, that makes sense. The instructions generally correspond to what we wrote, but there's also one mystery entry at 40010c
, a bunch of extra zeroes. It's there to pad the code to multiple of 64bits, so the constant pool starts at even 64bit boundary.
Architectures differ on their alignment requirements. Usually "misaligned" data access still works, it's just slower. And it used to be the case on some architectures, that unaligned data access just wouldn't be supported at all.
I don't think any modern architecture strictly requires alignment for normal data access, but compilers still care about it as it's bad for performance, and occasionally some extra features like atomic memory access or SIMD might require aligned addresses. It's easiest to just align all the things. A few extra zeroes are usually no big deal, and it's done for us automatically.
Print loop
We can print one number, so how about we print a bunch of them? We already know how to loop, so it should be easy. Other than b.le
(branch if less or equal), there's nothing new here:
.global _start
.text
print_number:
ldr x1, =buffer
add x1, x1, 31
mov x2, 2
mov x3, 10
print_number_loop:
sdiv x4, x0, x3
msub x5, x4, x3, x0
add x5, x5, 48
strb w5, [x1]
cmp x4, 0
b.eq print_number_loop_done
mov x0, x4
sub x1, x1, 1
add x2, x2, 1
b print_number_loop
print_number_loop_done:
mov x8, 64
mov x0, 1
svc 0
ret
_start:
mov x19, 1
loop:
mov x0, x19
bl print_number
add x19, x19, 1
cmp x19, 20
b.le loop
/* _exit(7) */
mov x8, 93
mov x0, 7
svc 0
.data
/* just put some random stuff in the buffer */
/* we'll overwrite it before printin anyway (except final \n) */
buffer:
.ascii "0123456789abcdef0123456789abcdef\n"
It prints:
$ as print_loop.s -o print_loop.o
$ ld print_loop.o -o print_loop
$ ./print_loop
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
FizzBuzz
And now we're ready to write the FizzBuzz! We've seen all the pieces before, now it's the time to assemble them.
.global _start
.text
print_number:
ldr x1, =buffer
add x1, x1, 31
mov x2, 2
mov x3, 10
print_number_loop:
sdiv x4, x0, x3
msub x5, x4, x3, x0
add x5, x5, 48
strb w5, [x1]
cmp x4, 0
b.eq print_number_loop_done
mov x0, x4
sub x1, x1, 1
add x2, x2, 1
b print_number_loop
print_number_loop_done:
mov x8, 64
mov x0, 1
svc 0
ret
_start:
mov x19, 1
mov x20, 3
mov x21, 5
loop:
/* x5 = x19 % 3 */
sdiv x4, x19, x20
msub x5, x4, x20, x19
/* is the remainder zero? */
cmp x5, 0
b.eq divides_by_three
does_not_divide_by_three:
/* x5 = x19 % 5 */
sdiv x4, x19, x21
msub x5, x4, x21, x19
/* is the remainder zero? */
cmp x5, 0
b.eq divides_by_five_only
divides_by_neither:
/* print_number(x19) */
mov x0, x19
bl print_number
b continue_loop
divides_by_three:
/* x5 = x19 % 5 */
sdiv x4, x19, x21
msub x5, x4, x21, x19
/* is the remainder zero? */
cmp x5, 0
b.eq divides_by_three_and_five
divides_by_three_only:
/* write(1, "Fizz", 5) */
mov x8, 64
mov x0, 1
ldr x1, =fizz
mov x2, 5
svc 0
b continue_loop
divides_by_five_only:
/* write(1, "Buzz", 5) */
mov x8, 64
mov x0, 1
ldr x1, =buzz
mov x2, 5
svc 0
b continue_loop
divides_by_three_and_five:
/* write(1, "FizzBuzz", 9) */
mov x8, 64
mov x0, 1
ldr x1, =fizzbuzz
mov x2, 9
svc 0
continue_loop:
add x19, x19, 1
cmp x19, 100
b.le loop
/* _exit(7) */
mov x8, 93
mov x0, 7
svc 0
.data
/* just put some random stuff in the buffer */
/* we'll overwrite it before printin anyway (except final \n) */
buffer:
.ascii "0123456789abcdef0123456789abcdef\n"
fizz:
.ascii "Fizz\n"
buzz:
.ascii "Buzz\n"
fizzbuzz:
.ascii "FizzBuzz\n"
Which does the expected thing:
$ as fizzbuzz.s -o fizzbuzz.o
$ ld fizzbuzz.o -o fizzbuzz
$ ./fizzbuzz
1
2
Fizz
4
Buzz
Fizz
7
8
Fizz
Buzz
11
Fizz
13
14
FizzBuzz
16
17
Fizz
19
Buzz
...
Buzz
Fizz
97
98
Fizz
Buzz
Should you use ARM64 Assembly?
Definitely not for writing any real programs.
Assembly is still a lot of fun to play with. If you want to learn some assembly for fun, and are trying to decide which one to start with, I recommend x86-64.
You probably have x86-64 computer already. There's orders of magnitude more x86-64 code you might want to decompile than ARM64 code - software for ARM devices like phones doesn't even come in assembly, they come in some bytecode format that only gets Just In Time compiled on the device. For x86-64 you have old games you might want to hack, CTF hacking challenges, real software with real security issues, and so on. x86-64 is also significantly more approachable, and it tried its best to keep things similar enough as much as it could over decades. ARM is a lot more fragmented, with multiple significantly different variants of ARM, so what you learn will need a lot more adjustment for a different ARM device.
But once you played with x86-64 enough, and want another one to try, ARM is the second most likely you'd have access to. Get a Raspberry Pi, preferably a 64bit kind, and have a go.
Code
All code examples for the series will be in this repository.