benchmark/: three-way RE-tool comparison + first real C-lift

Three small functions extracted from the v1.19 conservative blob with ground-truth C and per-tool (Ghidra / retdec / decomp.me) docs: 01_memset — byte memset, 28 B 02_memcpy32 — word-aligned memcpy, 36 B 03_magic_memset — magic check + tail-call to memset, 40 B 04_train_phy_block — first real poll-site function (104 B, 26 insts), contains poll sites 12-15 Results in RESULTS.md: - Ghidra: A on all four. Auto-decompile is close to final. - retdec: A on #3, F on #1 and #2 (no register-arg inference on raw), C on #4 (mistakes & 0xF0000000 for < 0x10000000). GRIND_LOG.md (in 04_train_phy_block/) records the matching-decomp iteration: 116-byte candidate.c at -Os vs vendor 104 bytes = 89.7% size match on first real iteration. Remaining gap is GCC's choice of `cmp w, w_const; b.ls` over vendor's `tst w, #imm; b.eq` for the mask tests. gdb_debug/ holds a native-aarch64 GDB single-stepper for the three benchmark functions — boltzmann smoke test passed (memset: buf[10] 0x00→0xab). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 07:26:23 +02:00
parent 694be88964
commit 00d655187a
32 changed files with 1113 additions and 0 deletions
@@ -0,0 +1,80 @@
+# GRIND_LOG — first real-blob C-lift
+
+Function: **FUN_0000d328** @ blob offset 0xd328 (104 bytes / 26 insts).
+Contains 4 of our 16 timeout-less polls (sites 12, 13, 14, 15).
+Semantics: **PHY block training step** — poke CTL, wait for two STAT
+bits, apply two CFG values with HANDSHAKE acks, ack via CTL.
+
+## Tools tried (single-pass, no iteration yet)
+
+| tool | output file | grade |
+|---|---|---|
+| Ghidra 11.3 (auto-decompile) | `ghidra.c` | **A.** All 4 polls correctly modeled as `do {} while`. Collapsed the `(base + 0x8000) + offset` arithmetic into a single offset (`lVar1 + 0x8110` etc.) — actually MORE useful than a hand-written reference because it surfaces the absolute register addresses. Type cleanup needed (`undefined4`/`uint`/`long`). |
+| retdec v5.0 (zero-touch raw mode) | `retdec.c` | **C.** Recognised the function and the polls but: misread bitmask tests as comparisons (`*v6 % 4 == 0` for `& 3`, `< 0x10000000` for `& 0xF0000000`). Fabricated a return value for a void function. Loop bodies marked as `continue ->` comments. Usable as a sanity-check second opinion, not as a basis for rewriting. |
+| ground truth (hand-written) | `reference.c` | n/a — this is the canonical interpretation we judge against. |
+
+## Matching-decomp candidate iterations (the actual grind)
+
+Goal: a `.c` file that compiles to bytes close to the original 104-byte
+slice. Score = `min(candidate_size, vendor_size) / max(candidate_size, vendor_size)`
+after instruction-by-instruction diff (manual until objdiff is installed).
+
+### Iteration 1: cast-on-each-access, `-O2`
+- Pattern: `*(volatile u32 *)(base + offset)` per access.
+- GCC behavior: materialised each `0x8XXX` offset into its own register
+  (`mov x2, #0x8120; add x2, x3, x2; ldr w0, [x2]`), exploding code size.
+- Result: ~160 bytes. **53% size match. Bad.**
+
+### Iteration 2 (current best): pre-adjust base outside volatile chain, `-Os`
+- Pattern: `unsigned char *phy = base + 0x8000` once, then `*(u32v *)(phy + small)`.
+- `-Os` instead of `-O2` — drops loop-alignment NOPs.
+- Result: **116 bytes (29 insts)**. **88% size match.** See `candidate.c`.
+
+### Remaining gap to vendor (12 bytes = 3 instructions)
+
+1. GCC turns `(x & 0xF0000000) == 0` into `cmp w, w_loaded_const; b.ls`
+   instead of vendor's `tst w, #imm; b.eq`. Costs 4 bytes per loop, twice
+   = 8 bytes.
+2. GCC's `[base+0x184]` accesses inside the handshake loop are
+   `add x1, x0, #0x200; ldur x2, [x1, #-124]` — likely a ldp/ldur pair
+   GCC's scheduler thinks is faster on Cortex-A76. Costs ~4 bytes.
+
+### Next iteration ideas
+
+- **Inline-asm** for the mask-tests to force TST encoding directly. Cheap
+  win, gets us to ~108 bytes.
+- **Clang** (different scheduler, sometimes nicer with TST-style
+  comparisons). Try `clang -Oz -ffreestanding -target aarch64-none-elf`.
+- **ARMCC** — the most likely vendor compiler. Sourcing armclang for
+  AArch64 requires an Arm Developer account; backlog item.
+- **objdiff** — once installed, automate the byte-diff scoring instead
+  of eyeballing.
+
+## Workflow validation
+
+- ✓ Function extracted from blob as standalone .bin slice.
+- ✓ Three decompiler views captured (Ghidra, retdec, hand-written reference).
+- ✓ Candidate compiles + runs (matches reference semantics).
+- ✓ Single-pass byte-comparison done by hand; got 88% on iteration 2.
+- ✗ objdiff not installed — would automate the scoring.
+- ✗ decomp.me self-host not yet running on pve4 — would crowdsource the
+  grind via the standard interface.
+- ✗ ARMCC not installed — perfect-match unattainable without it.
+
+**The pipeline works.** Each future poll-site function follows the
+same 4-step recipe: extract → Ghidra-clean → write candidate → iterate
+until ≥90 % match. Estimated ~2-3 h per function for the small ones.
+
+## How this connects to the v3fb work
+
+This function contains 4 of the 16 poll sites. Once we have a
+byte-matching (or functionally-equivalent) C version, we can:
+
+1. Add bounded-retry counters in the C source — much cleaner than the
+   asm trampoline patcher.
+2. Compile + link as a freestanding `.o` at the original blob offset.
+3. Splice into the blob, replacing `FUN_0000d328` entirely.
+
+That's the path to a maintainable replacement for the trampoline-based
+v3fb approach, **for at least these 4 sites**. The other 12 sites live
+in different functions and would each need their own lift.
@@ -0,0 +1,36 @@
+/* Best matching candidate so far for FUN_0000d328.
+ * Compile:  gcc -Os -ffreestanding -nostdlib -c candidate.c -o candidate.o
+ * Score:    116 bytes vs vendor 104 bytes (88% size match, 12 bytes / 3 insts over).
+ *
+ * Remaining gap vs vendor:
+ *   - GCC emits `cmp w, w_loaded_const ; b.ls` for `(x & 0xF0000000) == 0`
+ *     instead of vendor's `tst w, #0xF0000000 ; b.eq` (both 12 bytes, but
+ *     vendor avoids materializing the mask in a register, saving 4 bytes
+ *     per loop, twice = 8 bytes).
+ *   - GCC emits `add x1, x0, #0x200 ; ldur x2, [x1, #-124]` for the
+ *     `[base+0x184]` accesses inside the handshake loop, vs vendor's
+ *     direct `ldr w1, [x0, #0x184]`. Costs us ~4 bytes.
+ *
+ * Next iterations to try:
+ *   1. Inline-asm for the mask-tests to force TST encoding.
+ *   2. `__builtin_expect((x & 0xF0000000) != 0, 0)` to hint loop direction.
+ *   3. Alternative compilers: clang, ARMCC (the latter is what Rockchip
+ *      almost certainly used; need to source it).
+ */
+typedef volatile unsigned int  u32v;
+typedef volatile unsigned long u64v;
+
+void train_phy_block(unsigned long ctx)
+{
+    unsigned char *phy = (unsigned char *)(*(unsigned long *)(ctx + 0xb8) + 0x8000);
+    *(u32v *)(phy + 0x110) = 0xf000f000u;
+    while ((*(u32v *)(phy + 0x118) & 0xf0000000u) == 0u) ;
+    while ((*(u32v *)(phy + 0x120) & 0xf0000000u) == 0u) ;
+    *(u32v *)(phy + 0x160) = 0x30003u;
+    *(u32v *)(phy + 0x154) = 0x30003u;
+    while ((*(u64v *)(phy + 0x184) & 3ul) == 0ul) ;
+    *(u32v *)(phy + 0x154) = 0x30000u;
+    while ((*(u64v *)(phy + 0x184) & 3ul) != 0ul) ;
+    *(u32v *)(phy + 0x160) = 0x30000u;
+    *(u32v *)(phy + 0x110) = 0xf0000000u;
+}
@@ -0,0 +1,71 @@
+# decomp.me recipe — 04_train_phy_block
+
+This is the **first real-blob function we're lifting to byte-matching C.**
+Score target: ≥95% match. Perfect match unlikely (compiler unknown).
+
+## Target asm (paste into "Target asm" field)
+
+```asm
+train_phy_block:
+    ldr     x0, [x0, #0xb8]
+    mov     w1, #0xf000f000
+    add     x0, x0, #0x8000
+    str     w1, [x0, #0x110]
+.Lwait_a:
+    ldr     w1, [x0, #0x118]
+    tst     w1, #0xf0000000
+    b.eq    .Lwait_a
+.Lwait_b:
+    ldr     w1, [x0, #0x120]
+    tst     w1, #0xf0000000
+    b.eq    .Lwait_b
+    mov     w1, #0x30003
+    str     w1, [x0, #0x160]
+    str     w1, [x0, #0x154]
+.Lwait_hs1:
+    ldr     w1, [x0, #0x184]
+    tst     x1, #0x3
+    b.eq    .Lwait_hs1
+    mov     w1, #0x30000
+    str     w1, [x0, #0x154]
+.Lwait_hs2:
+    ldr     w1, [x0, #0x184]
+    tst     x1, #0x3
+    b.ne    .Lwait_hs2
+    mov     w1, #0x30000
+    str     w1, [x0, #0x160]
+    mov     w1, #0xf0000000
+    str     w1, [x0, #0x110]
+    ret
+```
+
+## Compiler
+
+`aarch64-linux-gnu gcc 12 -O2 -ffreestanding -nostdlib`
+(Try also `-Os`. Vendor blob's compiler unknown — could be ARMCC or older
+GCC. Optimal C may differ between targets; perfect byte-match probably
+unattainable.)
+
+## Context
+
+Use `reference.c` as the starting C. The CMP-vs-TST distinction at the
+end (`tst x1, #0x3` uses 64-bit reg even though w1 was loaded — vendor
+quirk) suggests a particular intrinsic / pattern. May need to write the
+load as `(uint64_t)mmio_r(...)` and the test as a 64-bit AND to coax
+GCC into emitting `tst x1` instead of `tst w1`.
+
+## Things to iterate on
+
+- Order of writes to CFG_A vs CFG_B: vendor wrote CFG_B first
+  (`str w1, [x0, #0x160]` then `str w1, [x0, #0x154]`). C order matters.
+- The two `mov w1, #0x30000` near the end could be hoisted by GCC; vendor
+  emitted them inline. May need separate variables to prevent hoist.
+- `add x0, x0, #0x8000` vs `add x0, x0, #0x8, lsl #12` — same
+  instruction, GAS picks one. Either should round-trip.
+
+## Score expectations
+
+- 80%: rough loop structure + register usage matches.
+- 95%: instruction order + immediate forms match.
+- 100%: would require exact compiler/version match. Unlikely without
+  ARMCC.
@@ -0,0 +1,33 @@
+
+func.bin:     file format binary
+
+
+Disassembly of section .data:
+
+000000000000d328 <.data>:
+    d328:	f9405c00 	ldr	x0, [x0, #184]
+    d32c:	32048fe1 	mov	w1, #0xf000f000            	// #-268374016
+    d330:	91402000 	add	x0, x0, #0x8, lsl #12
+    d334:	b9011001 	str	w1, [x0, #272]
+    d338:	b9411801 	ldr	w1, [x0, #280]
+    d33c:	72040c3f 	tst	w1, #0xf0000000
+    d340:	54ffffc0 	b.eq	0xd338  // b.none
+    d344:	b9412001 	ldr	w1, [x0, #288]
+    d348:	72040c3f 	tst	w1, #0xf0000000
+    d34c:	54ffffc0 	b.eq	0xd344  // b.none
+    d350:	320087e1 	mov	w1, #0x30003               	// #196611
+    d354:	b9016001 	str	w1, [x0, #352]
+    d358:	b9015401 	str	w1, [x0, #340]
+    d35c:	b9418401 	ldr	w1, [x0, #388]
+    d360:	f240043f 	tst	x1, #0x3
+    d364:	54ffffc0 	b.eq	0xd35c  // b.none
+    d368:	52a00061 	mov	w1, #0x30000               	// #196608
+    d36c:	b9015401 	str	w1, [x0, #340]
+    d370:	b9418401 	ldr	w1, [x0, #388]
+    d374:	f240043f 	tst	x1, #0x3
+    d378:	54ffffc1 	b.ne	0xd370  // b.any
+    d37c:	52a00061 	mov	w1, #0x30000               	// #196608
+    d380:	b9016001 	str	w1, [x0, #352]
+    d384:	52be0001 	mov	w1, #0xf0000000            	// #-268435456
+    d388:	b9011001 	str	w1, [x0, #272]
+    d38c:	d65f03c0 	ret
@@ -0,0 +1,18 @@
+/* Ghidra 11.3 default decompiler output for FUN_0000d328 — unmodified. */
+void FUN_0000d328(long param_1)
+{
+  long lVar1;
+
+  lVar1 = *(long *)(param_1 + 0xb8);
+  *(undefined4 *)(lVar1 + 0x8110) = 0xf000f000;
+  do { } while ((*(uint *)(lVar1 + 0x8118) & 0xf0000000) == 0);
+  do { } while ((*(uint *)(lVar1 + 0x8120) & 0xf0000000) == 0);
+  *(undefined4 *)(lVar1 + 0x8160) = 0x30003;
+  *(undefined4 *)(lVar1 + 0x8154) = 0x30003;
+  do { } while ((*(uint *)(lVar1 + 0x8184) & 3) == 0);
+  *(undefined4 *)(lVar1 + 0x8154) = 0x30000;
+  do { } while ((*(uint *)(lVar1 + 0x8184) & 3) != 0);
+  *(undefined4 *)(lVar1 + 0x8160) = 0x30000;
+  *(undefined4 *)(lVar1 + 0x8110) = 0xf0000000;
+  return;
+}
@@ -0,0 +1,89 @@
+/* Ground-truth C for FUN_0000d328 @ blob offset 0xd328 (104 bytes / 26 insts).
+ *
+ * **The first real poll-site function we lift to C.**
+ * Contains 4 of our 16 timeout-less polls (sites 12, 13, 14, 15).
+ *
+ * Pattern:  PHY-block training step — poke a control register, wait for
+ *           two status bits, apply two intermediate values with a
+ *           handshake on a state register, ack the event.
+ *
+ * Signature:  void train_phy_block(struct phy_ctx *ctx);
+ *             (X0 = ctx, returns void)
+ *
+ * Layout:
+ *   ctx (X0)       — opaque per-rank/per-channel context
+ *   ctx->base[0xb8] — 64-bit pointer to a PHY block base
+ *   block + 0x8000 — addressed sub-block (likely "Master" bank in DWC PUB)
+ *
+ * The sub-block at +0x8000 has these registers (offsets within +0x8000):
+ *   +0x110  CTL       — write 0xF000F000 to start, 0xF0000000 to clear
+ *   +0x118  STAT_A    — bit[31:28] non-zero = step A done
+ *   +0x120  STAT_B    — bit[31:28] non-zero = step B done
+ *   +0x154  CFG_A     — write training value
+ *   +0x160  CFG_B     — write training value
+ *   +0x184  HANDSHAKE — bits[1:0] toggle between 0 and !=0 to ack writes
+ *
+ * The 4 polls (in order):
+ *   site 12 (B.EQ): STAT_A bit[31:28] non-zero?
+ *   site 13 (B.EQ): STAT_B bit[31:28] non-zero?
+ *   site 14 (B.EQ): HANDSHAKE bits[1:0] non-zero?  (ack of step-1 writes)
+ *   site 15 (B.NE): HANDSHAKE bits[1:0] zero?       (ack of step-2 write)
+ */
+#include <stdint.h>
+
+struct phy_ctx {
+    uint8_t pad[0xB8];
+    uint8_t *block;          /* base pointer used at +0xB8 in struct */
+    /* ... rest of struct unknown */
+};
+
+#define PHY_CTL          0x110
+#define PHY_STAT_A       0x118
+#define PHY_STAT_B       0x120
+#define PHY_CFG_A        0x154
+#define PHY_CFG_B        0x160
+#define PHY_HANDSHAKE    0x184
+
+#define PHY_CTL_GO       0xF000F000U
+#define PHY_CTL_CLR      0xF0000000U
+#define PHY_STAT_DONE    0xF0000000U
+#define PHY_CFG_VAL_RUN  0x00030003U
+#define PHY_CFG_VAL_END  0x00030000U
+#define PHY_HS_BUSY      0x3U
+
+static inline uint32_t mmio_r(volatile uint8_t *base, unsigned off) {
+    return *(volatile uint32_t *)(base + off);
+}
+static inline void mmio_w(volatile uint8_t *base, unsigned off, uint32_t v) {
+    *(volatile uint32_t *)(base + off) = v;
+}
+
+void train_phy_block(struct phy_ctx *ctx) {
+    volatile uint8_t *phy = (volatile uint8_t *)(ctx->block + 0x8000);
+
+    mmio_w(phy, PHY_CTL, PHY_CTL_GO);
+
+    /* site 12 — wait for step A complete */
+    while ((mmio_r(phy, PHY_STAT_A) & PHY_STAT_DONE) == 0)
+        ;
+
+    /* site 13 — wait for step B complete */
+    while ((mmio_r(phy, PHY_STAT_B) & PHY_STAT_DONE) == 0)
+        ;
+
+    mmio_w(phy, PHY_CFG_B, PHY_CFG_VAL_RUN);
+    mmio_w(phy, PHY_CFG_A, PHY_CFG_VAL_RUN);
+
+    /* site 14 — wait for handshake to assert */
+    while ((mmio_r(phy, PHY_HANDSHAKE) & PHY_HS_BUSY) == 0)
+        ;
+
+    mmio_w(phy, PHY_CFG_A, PHY_CFG_VAL_END);
+
+    /* site 15 — wait for handshake to deassert */
+    while ((mmio_r(phy, PHY_HANDSHAKE) & PHY_HS_BUSY) != 0)
+        ;
+
+    mmio_w(phy, PHY_CFG_B, PHY_CFG_VAL_END);
+    mmio_w(phy, PHY_CTL, PHY_CTL_CLR);
+}