Rust Coroutines on AArch64 (ARM64)

The amazing book Asynchronous Programming in Rust by Carl Fredrik Samson has a chapter on implementing stackful coroutines (fibers) in Rust for x86_64 architecture, both Linux and MacOS. But unlike other chapters, no AArch64 implementation was provided for the sake of simplicity. I tried to port it to AArch64, and this post describes my approach.

This text is just an additional material for the book. I presume you are familiar either with it or with the original code at ch05/c-fibers.

Please note that unstable Rust has had a lot of changes since the book was written. The examples in the article use modern version of Rust with #[unsafe(naked)] and naked_asm! for naked functions.

1. AArch64 ABI – how AArch64 does calls

The AArch64 ABI is described in the “Procedure Call Standard for the Arm 64-bit Architecture” document found here: https://github.com/ARM-software/abi-aa/releases.

The chapter “6. The Base Procedure Call Standard” declares:

The registers r19..r28 are callee-saved registers.
r29 aka FP is a frame pointer register, also callee-saved.
Subroutine call doesn’t push return address to a stack, but uses r30 register aka LR; its value is caller-saved.
SP mod 16 == 0. The stack is always 16-byte aligned.
r16 and r17 are scratch (temporary) registers. r18 is “Platform Register, if needed; otherwise a temporary”.

It’s worth checking MacOS-specific ABI details: https://developer.apple.com/documentation/xcode/writing-arm64-code-for-apple-platforms. However, there is nothing specific for our implementation, except confirming that r18 should be ignored.

The Linux ABI simply refers to the generic AArch64 ABI.

Please note that we ignore here floating point and vector register for simplicity, as the original implementation does.

1.1. Bits of AArch64 assembler

The instruction set used in the fiber engine is quite small.

`x86_64` syntax	`AArch64` syntax	Comment
`mov Target, Source`	`mov Target, Source`	Moving value from register to register
`mov Target, [Source + offset]`	`ldr Target, [Source, offset]`	Loading register from memory with offset
`mov [Target + offset], Source`	`str Source, [Target, offset]`	Storing register to memory with offset
`call Target`	`blr Target`	Subroutine call (but semantics is different)
`ret`	`ret`	Return from a subroutine (but semantics is different)

The instruction for moving , loading and storing values are more or less same, though naming is different. Subroutine calls and returns, however, are really different in AArch64.

1.2. Link register LR

When an x86_64 processor executes a call instruction, it pushes a return address (i.e. the address of the next instruction) to the stack. Unlike x86_64, AArch64 has a dedicated register for that, LR.

A ret instruction also uses LR for return instead of popping an address from the stack. It is almost equivalent to br LR instruction (branch to register LR), but ret gives a hint to processor that this is a return from a blr subroutine call, enabling better branch prediction.

But how can we call a nested subroutine? The nested blr will overwrite the LR value. Well, the LR register is a caller-saved, so the parent subroutine should store LR somewhere before calling any nested one. Usually, it is stored on the SP stack. Using a method of careful inspection, we can immediately see that a x86_64 program with N nested calls will push a return address to stack N times, and AArch64 program will store a return address N-1 times¹. Not a big deal.

However, while we used the keep return address at the stack top and ret trick to spawn a fiber, we cannot use it at AArch64. While memory is still the only place to keep LR between the switches, we have to save and restore the LR register manually at spawn, and it requires some more assembler code.

2. Running a fiber

The ThreadContext is still a sequence of u64 fields, one field for each register to be saved and restored. Saving and restoring is also straightforward, except SP register cannot be saved (or loaded) to memory directly and has to be stored to an intermediate register first (and restored from an intermediate register).

#[derive(Debug, Default)]
#[repr(C)]
struct ThreadContext {
    r19: u64, // 00
    r20: u64, // 08
    r21: u64, // 10
    r22: u64, // 18
    r23: u64, // 20
    r24: u64, // 28
    r25: u64, // 30
    r26: u64, // 38
    r27: u64, // 40
    r28: u64, // 48
    fp: u64,  // 50
    lr: u64,  // 58
    sp: u64,  // 60
}

The x86_64 version imitated nested calls: f, the fiber body, a technical skip function and then the guard function.

Here, we don’t need skip function to be run after f; we need trampoline function to be run before f.

2.1 Spawning a fiber

Launching a fiber is little more involved. We cannot just arrange stack to have return addresses of functions; we have to set LR to be equal to guard and then jump to fiber function. The ThreadContext allows us only to set LR register; let’s save the values needed to the stack and set the context’s LR to a trampoline function that does the heavy lifting.

const F_TRAMPOLINE_OFFSET: usize = 0;
const GUARD_TRAMPOLINE_OFFSET: usize = F_TRAMPOLINE_OFFSET + std::mem::size_of::<u64>();

impl Runtime {
    // ...

    pub fn spawn(&mut self, f: fn()) {
        let available = self
            .threads
            .iter_mut()
            .find(|t| t.state == State::Available)
            .expect("no available thread.");

        let size = available.stack.len();

        // prepare stack for the `trampoline`
        let stack_top;
        unsafe {
            let s_end = available.stack.as_mut_ptr().add(size);
            let s_end_aligned = (s_end as usize & !15) as *mut u8;
            stack_top = s_end_aligned.offset(-32);
            std::ptr::write(stack_top.add(F_TRAMPOLINE_OFFSET).cast::<u64>(), f as u64);
            std::ptr::write(stack_top.add(GUARD_TRAMPOLINE_OFFSET).cast::<u64>(), guard as u64);
        }
        available.ctx.lr = trampoline as u64;
        available.ctx.fp = 0;
        available.ctx.sp = stack_top as u64;
        available.state = State::Ready;
    }
}

#[unsafe(naked)]
#[no_mangle]
unsafe extern "C" fn trampoline() {
    // The stack is prepared by the `Runtime::spawn`:
    // sp + 00      function_address
    // sp + 08      guard address
    // sp + 10 ..   unused
    naked_asm! {
        "ldr x1, [sp, {f_trampoline_offset}]",
        "ldr lr, [sp, {guard_trampoline_offset}]",
        "sub sp, sp, 0x10",  // current stack frame is not needed anymore
        "br x1",
        f_trampoline_offset = const F_TRAMPOLINE_OFFSET,
        guard_trampoline_offset = const GUARD_TRAMPOLINE_OFFSET,
    };
}

Homework 2.1

Pass the values to the trampoline with r0 and r1 registers instead of the stack. Does it make any performance difference? Does it make the code cleaner?

2.2 Switching contexts

As it was explained before, switching contexts is straightforward:

#[unsafe(naked)]
#[no_mangle]
unsafe extern "C" fn switch() {
    naked_asm! {
        // saving the old fiber.
        // TODO we might use `stp` instruction to store a pair of registers at once, but we don't.
        "str  x19, [x0, 0x00]",
        "str  x20, [x0, 0x08]",
        "str  x21, [x0, 0x10]",
        "str  x22, [x0, 0x18]",
        "str  x23, [x0, 0x20]",
        "str  x24, [x0, 0x28]",
        "str  x25, [x0, 0x30]",
        "str  x26, [x0, 0x38]",
        "str  x27, [x0, 0x40]",
        "str  x28, [x0, 0x48]",
        "str  fp, [x0, 0x50]",
        "str  lr, [x0, 0x58]",
        // sp cannot be stored/loaded directly -- use an intermediate register, one of the stored/loaded.
        "mov  x2, sp",
        "str  x2, [x0, 0x60]",

        // loading the new fiber
        // TODO we might use `ldp` instruction to load a pair of registers at once, but we don't.
        "ldr  x19, [x1, 0x00]",
        "ldr  x20, [x1, 0x08]",
        "ldr  x21, [x1, 0x10]",
        "ldr  x22, [x1, 0x18]",
        "ldr  x23, [x1, 0x20]",
        "ldr  x24, [x1, 0x28]",
        "ldr  x25, [x1, 0x30]",
        "ldr  x26, [x1, 0x38]",
        "ldr  x27, [x1, 0x40]",
        "ldr  x28, [x1, 0x48]",
        "ldr  fp, [x1, 0x50]",
        "ldr  lr, [x1, 0x58]",
        "ldr  x2, [x1, 0x60]",
        "mov  sp, x2",
        "ret",
    };
}

You may find the source code of the full example at the pull-request: https://github.com/PacktPublishing/Asynchronous-Programming-in-Rust/pull/34.

Homework 2.2

Reimplement switch using stp/ldp. Use post-increment mode rather than fixed offsets. If you don’t know AArch64’s post-increment syntax, ask your mom.

Porting the example was not that difficult: the only difficulty was setting the LR register instead of pushing addresses to the stack. You may find it instructive to port other examples of the Chapter 5!

3. Thoughts on a portable version

A trampoline simplifies creating a version that works on both architectures (with proper #[cfg]s). On x86_64, a trampoline might look like this:

        // Set up in the `Runtime::spawn`:
        unsafe {
            stack_top = s_end_aligned.sub(std::mem::size_of::<u64>());
            std::ptr::write(stack_top.cast::<u64>(), trampoline as u64);
        }
        available.ctx.rax = f as u64;
        available.ctx.rbx = guard as u64;
        available.ctx.sp = stack_top;
        // ..

#[unsafe(naked)]
#[no_mangle]
unsafe extern "C" fn trampoline() {
    // The stack and registers are set by the `Runtime::spawn`:
    // 1. the stack is aligned by 16 bytes boundary
    // 2. rax contains f's address
    // 3. rbx contains guard's address
    naked_asm! {
        "push rbx",  // sp is incremented by 8
        "call rax",  // sp is incremented by 8 and now it is aligned again at the f invocation
        // after return, it is aligned by 8
        "ret",       // after returning to the `guard`, it is aligned by 16 again
    };
}

Homework 3.1

Write a version that works on both architectures.

Disclaimer

This text was written by Ivan Boldyrev. AI tools were used only for proofreading.

An attentive reader may say that N nested calls require N-1 pushes on x86_64 and N-2 pushes on AArch64; it may true for normal threads like a main function, but for our fiber we use push-and-ret trick to spawn it, so there is an extra push at the very top of the stack. ↩︎

1. AArch64 ABI – how AArch64 does calls#

1.1. Bits of AArch64 assembler#

1.2. Link register LR#

2. Running a fiber#

2.1 Spawning a fiber#

2.2 Switching contexts#

3. Thoughts on a portable version#

Disclaimer#

1. AArch64 ABI – how AArch64 does calls

1.1. Bits of AArch64 assembler

1.2. Link register LR

2. Running a fiber

2.1 Spawning a fiber

2.2 Switching contexts

3. Thoughts on a portable version

Disclaimer