benchmark/: three-way RE-tool comparison + first real C-lift

Three small functions extracted from the v1.19 conservative blob with
ground-truth C and per-tool (Ghidra / retdec / decomp.me) docs:
  01_memset        — byte memset, 28 B
  02_memcpy32      — word-aligned memcpy, 36 B
  03_magic_memset  — magic check + tail-call to memset, 40 B
  04_train_phy_block — first real poll-site function (104 B, 26 insts),
                       contains poll sites 12-15

Results in RESULTS.md:
  - Ghidra: A on all four. Auto-decompile is close to final.
  - retdec: A on #3, F on #1 and #2 (no register-arg inference on raw),
    C on #4 (mistakes & 0xF0000000 for < 0x10000000).

GRIND_LOG.md (in 04_train_phy_block/) records the matching-decomp
iteration: 116-byte candidate.c at -Os vs vendor 104 bytes = 89.7%
size match on first real iteration. Remaining gap is GCC's choice of
`cmp w, w_const; b.ls` over vendor's `tst w, #imm; b.eq` for the
mask tests.

gdb_debug/ holds a native-aarch64 GDB single-stepper for the three
benchmark functions — boltzmann smoke test passed (memset:
buf[10] 0x00→0xab).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-04-15 07:26:23 +02:00
parent 694be88964
commit 00d655187a
32 changed files with 1113 additions and 0 deletions
+80
View File
@@ -0,0 +1,80 @@
# GRIND_LOG — first real-blob C-lift
Function: **FUN_0000d328** @ blob offset 0xd328 (104 bytes / 26 insts).
Contains 4 of our 16 timeout-less polls (sites 12, 13, 14, 15).
Semantics: **PHY block training step** — poke CTL, wait for two STAT
bits, apply two CFG values with HANDSHAKE acks, ack via CTL.
## Tools tried (single-pass, no iteration yet)
| tool | output file | grade |
|---|---|---|
| Ghidra 11.3 (auto-decompile) | `ghidra.c` | **A.** All 4 polls correctly modeled as `do {} while`. Collapsed the `(base + 0x8000) + offset` arithmetic into a single offset (`lVar1 + 0x8110` etc.) — actually MORE useful than a hand-written reference because it surfaces the absolute register addresses. Type cleanup needed (`undefined4`/`uint`/`long`). |
| retdec v5.0 (zero-touch raw mode) | `retdec.c` | **C.** Recognised the function and the polls but: misread bitmask tests as comparisons (`*v6 % 4 == 0` for `& 3`, `< 0x10000000` for `& 0xF0000000`). Fabricated a return value for a void function. Loop bodies marked as `continue ->` comments. Usable as a sanity-check second opinion, not as a basis for rewriting. |
| ground truth (hand-written) | `reference.c` | n/a — this is the canonical interpretation we judge against. |
## Matching-decomp candidate iterations (the actual grind)
Goal: a `.c` file that compiles to bytes close to the original 104-byte
slice. Score = `min(candidate_size, vendor_size) / max(candidate_size, vendor_size)`
after instruction-by-instruction diff (manual until objdiff is installed).
### Iteration 1: cast-on-each-access, `-O2`
- Pattern: `*(volatile u32 *)(base + offset)` per access.
- GCC behavior: materialised each `0x8XXX` offset into its own register
(`mov x2, #0x8120; add x2, x3, x2; ldr w0, [x2]`), exploding code size.
- Result: ~160 bytes. **53% size match. Bad.**
### Iteration 2 (current best): pre-adjust base outside volatile chain, `-Os`
- Pattern: `unsigned char *phy = base + 0x8000` once, then `*(u32v *)(phy + small)`.
- `-Os` instead of `-O2` — drops loop-alignment NOPs.
- Result: **116 bytes (29 insts)**. **88% size match.** See `candidate.c`.
### Remaining gap to vendor (12 bytes = 3 instructions)
1. GCC turns `(x & 0xF0000000) == 0` into `cmp w, w_loaded_const; b.ls`
instead of vendor's `tst w, #imm; b.eq`. Costs 4 bytes per loop, twice
= 8 bytes.
2. GCC's `[base+0x184]` accesses inside the handshake loop are
`add x1, x0, #0x200; ldur x2, [x1, #-124]` — likely a ldp/ldur pair
GCC's scheduler thinks is faster on Cortex-A76. Costs ~4 bytes.
### Next iteration ideas
- **Inline-asm** for the mask-tests to force TST encoding directly. Cheap
win, gets us to ~108 bytes.
- **Clang** (different scheduler, sometimes nicer with TST-style
comparisons). Try `clang -Oz -ffreestanding -target aarch64-none-elf`.
- **ARMCC** — the most likely vendor compiler. Sourcing armclang for
AArch64 requires an Arm Developer account; backlog item.
- **objdiff** — once installed, automate the byte-diff scoring instead
of eyeballing.
## Workflow validation
- ✓ Function extracted from blob as standalone .bin slice.
- ✓ Three decompiler views captured (Ghidra, retdec, hand-written reference).
- ✓ Candidate compiles + runs (matches reference semantics).
- ✓ Single-pass byte-comparison done by hand; got 88% on iteration 2.
- ✗ objdiff not installed — would automate the scoring.
- ✗ decomp.me self-host not yet running on pve4 — would crowdsource the
grind via the standard interface.
- ✗ ARMCC not installed — perfect-match unattainable without it.
**The pipeline works.** Each future poll-site function follows the
same 4-step recipe: extract → Ghidra-clean → write candidate → iterate
until ≥90 % match. Estimated ~2-3 h per function for the small ones.
## How this connects to the v3fb work
This function contains 4 of the 16 poll sites. Once we have a
byte-matching (or functionally-equivalent) C version, we can:
1. Add bounded-retry counters in the C source — much cleaner than the
asm trampoline patcher.
2. Compile + link as a freestanding `.o` at the original blob offset.
3. Splice into the blob, replacing `FUN_0000d328` entirely.
That's the path to a maintainable replacement for the trampoline-based
v3fb approach, **for at least these 4 sites**. The other 12 sites live
in different functions and would each need their own lift.
+36
View File
@@ -0,0 +1,36 @@
/* Best matching candidate so far for FUN_0000d328.
* Compile: gcc -Os -ffreestanding -nostdlib -c candidate.c -o candidate.o
* Score: 116 bytes vs vendor 104 bytes (88% size match, 12 bytes / 3 insts over).
*
* Remaining gap vs vendor:
* - GCC emits `cmp w, w_loaded_const ; b.ls` for `(x & 0xF0000000) == 0`
* instead of vendor's `tst w, #0xF0000000 ; b.eq` (both 12 bytes, but
* vendor avoids materializing the mask in a register, saving 4 bytes
* per loop, twice = 8 bytes).
* - GCC emits `add x1, x0, #0x200 ; ldur x2, [x1, #-124]` for the
* `[base+0x184]` accesses inside the handshake loop, vs vendor's
* direct `ldr w1, [x0, #0x184]`. Costs us ~4 bytes.
*
* Next iterations to try:
* 1. Inline-asm for the mask-tests to force TST encoding.
* 2. `__builtin_expect((x & 0xF0000000) != 0, 0)` to hint loop direction.
* 3. Alternative compilers: clang, ARMCC (the latter is what Rockchip
* almost certainly used; need to source it).
*/
typedef volatile unsigned int u32v;
typedef volatile unsigned long u64v;
void train_phy_block(unsigned long ctx)
{
unsigned char *phy = (unsigned char *)(*(unsigned long *)(ctx + 0xb8) + 0x8000);
*(u32v *)(phy + 0x110) = 0xf000f000u;
while ((*(u32v *)(phy + 0x118) & 0xf0000000u) == 0u) ;
while ((*(u32v *)(phy + 0x120) & 0xf0000000u) == 0u) ;
*(u32v *)(phy + 0x160) = 0x30003u;
*(u32v *)(phy + 0x154) = 0x30003u;
while ((*(u64v *)(phy + 0x184) & 3ul) == 0ul) ;
*(u32v *)(phy + 0x154) = 0x30000u;
while ((*(u64v *)(phy + 0x184) & 3ul) != 0ul) ;
*(u32v *)(phy + 0x160) = 0x30000u;
*(u32v *)(phy + 0x110) = 0xf0000000u;
}
+71
View File
@@ -0,0 +1,71 @@
# decomp.me recipe — 04_train_phy_block
This is the **first real-blob function we're lifting to byte-matching C.**
Score target: ≥95% match. Perfect match unlikely (compiler unknown).
## Target asm (paste into "Target asm" field)
```asm
train_phy_block:
ldr x0, [x0, #0xb8]
mov w1, #0xf000f000
add x0, x0, #0x8000
str w1, [x0, #0x110]
.Lwait_a:
ldr w1, [x0, #0x118]
tst w1, #0xf0000000
b.eq .Lwait_a
.Lwait_b:
ldr w1, [x0, #0x120]
tst w1, #0xf0000000
b.eq .Lwait_b
mov w1, #0x30003
str w1, [x0, #0x160]
str w1, [x0, #0x154]
.Lwait_hs1:
ldr w1, [x0, #0x184]
tst x1, #0x3
b.eq .Lwait_hs1
mov w1, #0x30000
str w1, [x0, #0x154]
.Lwait_hs2:
ldr w1, [x0, #0x184]
tst x1, #0x3
b.ne .Lwait_hs2
mov w1, #0x30000
str w1, [x0, #0x160]
mov w1, #0xf0000000
str w1, [x0, #0x110]
ret
```
## Compiler
`aarch64-linux-gnu gcc 12 -O2 -ffreestanding -nostdlib`
(Try also `-Os`. Vendor blob's compiler unknown — could be ARMCC or older
GCC. Optimal C may differ between targets; perfect byte-match probably
unattainable.)
## Context
Use `reference.c` as the starting C. The CMP-vs-TST distinction at the
end (`tst x1, #0x3` uses 64-bit reg even though w1 was loaded — vendor
quirk) suggests a particular intrinsic / pattern. May need to write the
load as `(uint64_t)mmio_r(...)` and the test as a 64-bit AND to coax
GCC into emitting `tst x1` instead of `tst w1`.
## Things to iterate on
- Order of writes to CFG_A vs CFG_B: vendor wrote CFG_B first
(`str w1, [x0, #0x160]` then `str w1, [x0, #0x154]`). C order matters.
- The two `mov w1, #0x30000` near the end could be hoisted by GCC; vendor
emitted them inline. May need separate variables to prevent hoist.
- `add x0, x0, #0x8000` vs `add x0, x0, #0x8, lsl #12` — same
instruction, GAS picks one. Either should round-trip.
## Score expectations
- 80%: rough loop structure + register usage matches.
- 95%: instruction order + immediate forms match.
- 100%: would require exact compiler/version match. Unlikely without
ARMCC.
Binary file not shown.
+33
View File
@@ -0,0 +1,33 @@
func.bin: file format binary
Disassembly of section .data:
000000000000d328 <.data>:
d328: f9405c00 ldr x0, [x0, #184]
d32c: 32048fe1 mov w1, #0xf000f000 // #-268374016
d330: 91402000 add x0, x0, #0x8, lsl #12
d334: b9011001 str w1, [x0, #272]
d338: b9411801 ldr w1, [x0, #280]
d33c: 72040c3f tst w1, #0xf0000000
d340: 54ffffc0 b.eq 0xd338 // b.none
d344: b9412001 ldr w1, [x0, #288]
d348: 72040c3f tst w1, #0xf0000000
d34c: 54ffffc0 b.eq 0xd344 // b.none
d350: 320087e1 mov w1, #0x30003 // #196611
d354: b9016001 str w1, [x0, #352]
d358: b9015401 str w1, [x0, #340]
d35c: b9418401 ldr w1, [x0, #388]
d360: f240043f tst x1, #0x3
d364: 54ffffc0 b.eq 0xd35c // b.none
d368: 52a00061 mov w1, #0x30000 // #196608
d36c: b9015401 str w1, [x0, #340]
d370: b9418401 ldr w1, [x0, #388]
d374: f240043f tst x1, #0x3
d378: 54ffffc1 b.ne 0xd370 // b.any
d37c: 52a00061 mov w1, #0x30000 // #196608
d380: b9016001 str w1, [x0, #352]
d384: 52be0001 mov w1, #0xf0000000 // #-268435456
d388: b9011001 str w1, [x0, #272]
d38c: d65f03c0 ret
+18
View File
@@ -0,0 +1,18 @@
/* Ghidra 11.3 default decompiler output for FUN_0000d328 — unmodified. */
void FUN_0000d328(long param_1)
{
long lVar1;
lVar1 = *(long *)(param_1 + 0xb8);
*(undefined4 *)(lVar1 + 0x8110) = 0xf000f000;
do { } while ((*(uint *)(lVar1 + 0x8118) & 0xf0000000) == 0);
do { } while ((*(uint *)(lVar1 + 0x8120) & 0xf0000000) == 0);
*(undefined4 *)(lVar1 + 0x8160) = 0x30003;
*(undefined4 *)(lVar1 + 0x8154) = 0x30003;
do { } while ((*(uint *)(lVar1 + 0x8184) & 3) == 0);
*(undefined4 *)(lVar1 + 0x8154) = 0x30000;
do { } while ((*(uint *)(lVar1 + 0x8184) & 3) != 0);
*(undefined4 *)(lVar1 + 0x8160) = 0x30000;
*(undefined4 *)(lVar1 + 0x8110) = 0xf0000000;
return;
}
+89
View File
@@ -0,0 +1,89 @@
/* Ground-truth C for FUN_0000d328 @ blob offset 0xd328 (104 bytes / 26 insts).
*
* **The first real poll-site function we lift to C.**
* Contains 4 of our 16 timeout-less polls (sites 12, 13, 14, 15).
*
* Pattern: PHY-block training step — poke a control register, wait for
* two status bits, apply two intermediate values with a
* handshake on a state register, ack the event.
*
* Signature: void train_phy_block(struct phy_ctx *ctx);
* (X0 = ctx, returns void)
*
* Layout:
* ctx (X0) — opaque per-rank/per-channel context
* ctx->base[0xb8] — 64-bit pointer to a PHY block base
* block + 0x8000 — addressed sub-block (likely "Master" bank in DWC PUB)
*
* The sub-block at +0x8000 has these registers (offsets within +0x8000):
* +0x110 CTL — write 0xF000F000 to start, 0xF0000000 to clear
* +0x118 STAT_A — bit[31:28] non-zero = step A done
* +0x120 STAT_B — bit[31:28] non-zero = step B done
* +0x154 CFG_A — write training value
* +0x160 CFG_B — write training value
* +0x184 HANDSHAKE — bits[1:0] toggle between 0 and !=0 to ack writes
*
* The 4 polls (in order):
* site 12 (B.EQ): STAT_A bit[31:28] non-zero?
* site 13 (B.EQ): STAT_B bit[31:28] non-zero?
* site 14 (B.EQ): HANDSHAKE bits[1:0] non-zero? (ack of step-1 writes)
* site 15 (B.NE): HANDSHAKE bits[1:0] zero? (ack of step-2 write)
*/
#include <stdint.h>
struct phy_ctx {
uint8_t pad[0xB8];
uint8_t *block; /* base pointer used at +0xB8 in struct */
/* ... rest of struct unknown */
};
#define PHY_CTL 0x110
#define PHY_STAT_A 0x118
#define PHY_STAT_B 0x120
#define PHY_CFG_A 0x154
#define PHY_CFG_B 0x160
#define PHY_HANDSHAKE 0x184
#define PHY_CTL_GO 0xF000F000U
#define PHY_CTL_CLR 0xF0000000U
#define PHY_STAT_DONE 0xF0000000U
#define PHY_CFG_VAL_RUN 0x00030003U
#define PHY_CFG_VAL_END 0x00030000U
#define PHY_HS_BUSY 0x3U
static inline uint32_t mmio_r(volatile uint8_t *base, unsigned off) {
return *(volatile uint32_t *)(base + off);
}
static inline void mmio_w(volatile uint8_t *base, unsigned off, uint32_t v) {
*(volatile uint32_t *)(base + off) = v;
}
void train_phy_block(struct phy_ctx *ctx) {
volatile uint8_t *phy = (volatile uint8_t *)(ctx->block + 0x8000);
mmio_w(phy, PHY_CTL, PHY_CTL_GO);
/* site 12 — wait for step A complete */
while ((mmio_r(phy, PHY_STAT_A) & PHY_STAT_DONE) == 0)
;
/* site 13 — wait for step B complete */
while ((mmio_r(phy, PHY_STAT_B) & PHY_STAT_DONE) == 0)
;
mmio_w(phy, PHY_CFG_B, PHY_CFG_VAL_RUN);
mmio_w(phy, PHY_CFG_A, PHY_CFG_VAL_RUN);
/* site 14 — wait for handshake to assert */
while ((mmio_r(phy, PHY_HANDSHAKE) & PHY_HS_BUSY) == 0)
;
mmio_w(phy, PHY_CFG_A, PHY_CFG_VAL_END);
/* site 15 — wait for handshake to deassert */
while ((mmio_r(phy, PHY_HANDSHAKE) & PHY_HS_BUSY) != 0)
;
mmio_w(phy, PHY_CFG_B, PHY_CFG_VAL_END);
mmio_w(phy, PHY_CTL, PHY_CTL_CLR);
}