benchmark/: three-way RE-tool comparison + first real C-lift

Three small functions extracted from the v1.19 conservative blob with ground-truth C and per-tool (Ghidra / retdec / decomp.me) docs: 01_memset — byte memset, 28 B 02_memcpy32 — word-aligned memcpy, 36 B 03_magic_memset — magic check + tail-call to memset, 40 B 04_train_phy_block — first real poll-site function (104 B, 26 insts), contains poll sites 12-15 Results in RESULTS.md: - Ghidra: A on all four. Auto-decompile is close to final. - retdec: A on #3, F on #1 and #2 (no register-arg inference on raw), C on #4 (mistakes & 0xF0000000 for < 0x10000000). GRIND_LOG.md (in 04_train_phy_block/) records the matching-decomp iteration: 116-byte candidate.c at -Os vs vendor 104 bytes = 89.7% size match on first real iteration. Remaining gap is GCC's choice of `cmp w, w_const; b.ls` over vendor's `tst w, #imm; b.eq` for the mask tests. gdb_debug/ holds a native-aarch64 GDB single-stepper for the three benchmark functions — boltzmann smoke test passed (memset: buf[10] 0x00→0xab). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 07:26:23 +02:00
parent 694be88964
commit 00d655187a
32 changed files with 1113 additions and 0 deletions
@@ -0,0 +1,80 @@
+# GRIND_LOG — first real-blob C-lift
+
+Function: **FUN_0000d328** @ blob offset 0xd328 (104 bytes / 26 insts).
+Contains 4 of our 16 timeout-less polls (sites 12, 13, 14, 15).
+Semantics: **PHY block training step** — poke CTL, wait for two STAT
+bits, apply two CFG values with HANDSHAKE acks, ack via CTL.
+
+## Tools tried (single-pass, no iteration yet)
+
+| tool | output file | grade |
+|---|---|---|
+| Ghidra 11.3 (auto-decompile) | `ghidra.c` | **A.** All 4 polls correctly modeled as `do {} while`. Collapsed the `(base + 0x8000) + offset` arithmetic into a single offset (`lVar1 + 0x8110` etc.) — actually MORE useful than a hand-written reference because it surfaces the absolute register addresses. Type cleanup needed (`undefined4`/`uint`/`long`). |
+| retdec v5.0 (zero-touch raw mode) | `retdec.c` | **C.** Recognised the function and the polls but: misread bitmask tests as comparisons (`*v6 % 4 == 0` for `& 3`, `< 0x10000000` for `& 0xF0000000`). Fabricated a return value for a void function. Loop bodies marked as `continue ->` comments. Usable as a sanity-check second opinion, not as a basis for rewriting. |
+| ground truth (hand-written) | `reference.c` | n/a — this is the canonical interpretation we judge against. |
+
+## Matching-decomp candidate iterations (the actual grind)
+
+Goal: a `.c` file that compiles to bytes close to the original 104-byte
+slice. Score = `min(candidate_size, vendor_size) / max(candidate_size, vendor_size)`
+after instruction-by-instruction diff (manual until objdiff is installed).
+
+### Iteration 1: cast-on-each-access, `-O2`
+- Pattern: `*(volatile u32 *)(base + offset)` per access.
+- GCC behavior: materialised each `0x8XXX` offset into its own register
+  (`mov x2, #0x8120; add x2, x3, x2; ldr w0, [x2]`), exploding code size.
+- Result: ~160 bytes. **53% size match. Bad.**
+
+### Iteration 2 (current best): pre-adjust base outside volatile chain, `-Os`
+- Pattern: `unsigned char *phy = base + 0x8000` once, then `*(u32v *)(phy + small)`.
+- `-Os` instead of `-O2` — drops loop-alignment NOPs.
+- Result: **116 bytes (29 insts)**. **88% size match.** See `candidate.c`.
+
+### Remaining gap to vendor (12 bytes = 3 instructions)
+
+1. GCC turns `(x & 0xF0000000) == 0` into `cmp w, w_loaded_const; b.ls`
+   instead of vendor's `tst w, #imm; b.eq`. Costs 4 bytes per loop, twice
+   = 8 bytes.
+2. GCC's `[base+0x184]` accesses inside the handshake loop are
+   `add x1, x0, #0x200; ldur x2, [x1, #-124]` — likely a ldp/ldur pair
+   GCC's scheduler thinks is faster on Cortex-A76. Costs ~4 bytes.
+
+### Next iteration ideas
+
+- **Inline-asm** for the mask-tests to force TST encoding directly. Cheap
+  win, gets us to ~108 bytes.
+- **Clang** (different scheduler, sometimes nicer with TST-style
+  comparisons). Try `clang -Oz -ffreestanding -target aarch64-none-elf`.
+- **ARMCC** — the most likely vendor compiler. Sourcing armclang for
+  AArch64 requires an Arm Developer account; backlog item.
+- **objdiff** — once installed, automate the byte-diff scoring instead
+  of eyeballing.
+
+## Workflow validation
+
+- ✓ Function extracted from blob as standalone .bin slice.
+- ✓ Three decompiler views captured (Ghidra, retdec, hand-written reference).
+- ✓ Candidate compiles + runs (matches reference semantics).
+- ✓ Single-pass byte-comparison done by hand; got 88% on iteration 2.
+- ✗ objdiff not installed — would automate the scoring.
+- ✗ decomp.me self-host not yet running on pve4 — would crowdsource the
+  grind via the standard interface.
+- ✗ ARMCC not installed — perfect-match unattainable without it.
+
+**The pipeline works.** Each future poll-site function follows the
+same 4-step recipe: extract → Ghidra-clean → write candidate → iterate
+until ≥90 % match. Estimated ~2-3 h per function for the small ones.
+
+## How this connects to the v3fb work
+
+This function contains 4 of the 16 poll sites. Once we have a
+byte-matching (or functionally-equivalent) C version, we can:
+
+1. Add bounded-retry counters in the C source — much cleaner than the
+   asm trampoline patcher.
+2. Compile + link as a freestanding `.o` at the original blob offset.
+3. Splice into the blob, replacing `FUN_0000d328` entirely.
+
+That's the path to a maintainable replacement for the trampoline-based
+v3fb approach, **for at least these 4 sites**. The other 12 sites live
+in different functions and would each need their own lift.