Three small functions extracted from the v1.19 conservative blob with
ground-truth C and per-tool (Ghidra / retdec / decomp.me) docs:
01_memset — byte memset, 28 B
02_memcpy32 — word-aligned memcpy, 36 B
03_magic_memset — magic check + tail-call to memset, 40 B
04_train_phy_block — first real poll-site function (104 B, 26 insts),
contains poll sites 12-15
Results in RESULTS.md:
- Ghidra: A on all four. Auto-decompile is close to final.
- retdec: A on #3, F on #1 and #2 (no register-arg inference on raw),
C on #4 (mistakes & 0xF0000000 for < 0x10000000).
GRIND_LOG.md (in 04_train_phy_block/) records the matching-decomp
iteration: 116-byte candidate.c at -Os vs vendor 104 bytes = 89.7%
size match on first real iteration. Remaining gap is GCC's choice of
`cmp w, w_const; b.ls` over vendor's `tst w, #imm; b.eq` for the
mask tests.
gdb_debug/ holds a native-aarch64 GDB single-stepper for the three
benchmark functions — boltzmann smoke test passed (memset:
buf[10] 0x00→0xab).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
4.3 KiB
GRIND_LOG — first real-blob C-lift
Function: FUN_0000d328 @ blob offset 0xd328 (104 bytes / 26 insts). Contains 4 of our 16 timeout-less polls (sites 12, 13, 14, 15). Semantics: PHY block training step — poke CTL, wait for two STAT bits, apply two CFG values with HANDSHAKE acks, ack via CTL.
Tools tried (single-pass, no iteration yet)
| tool | output file | grade |
|---|---|---|
| Ghidra 11.3 (auto-decompile) | ghidra.c |
A. All 4 polls correctly modeled as do {} while. Collapsed the (base + 0x8000) + offset arithmetic into a single offset (lVar1 + 0x8110 etc.) — actually MORE useful than a hand-written reference because it surfaces the absolute register addresses. Type cleanup needed (undefined4/uint/long). |
| retdec v5.0 (zero-touch raw mode) | retdec.c |
C. Recognised the function and the polls but: misread bitmask tests as comparisons (*v6 % 4 == 0 for & 3, < 0x10000000 for & 0xF0000000). Fabricated a return value for a void function. Loop bodies marked as continue -> comments. Usable as a sanity-check second opinion, not as a basis for rewriting. |
| ground truth (hand-written) | reference.c |
n/a — this is the canonical interpretation we judge against. |
Matching-decomp candidate iterations (the actual grind)
Goal: a .c file that compiles to bytes close to the original 104-byte
slice. Score = min(candidate_size, vendor_size) / max(candidate_size, vendor_size)
after instruction-by-instruction diff (manual until objdiff is installed).
Iteration 1: cast-on-each-access, -O2
- Pattern:
*(volatile u32 *)(base + offset)per access. - GCC behavior: materialised each
0x8XXXoffset into its own register (mov x2, #0x8120; add x2, x3, x2; ldr w0, [x2]), exploding code size. - Result: ~160 bytes. 53% size match. Bad.
Iteration 2 (current best): pre-adjust base outside volatile chain, -Os
- Pattern:
unsigned char *phy = base + 0x8000once, then*(u32v *)(phy + small). -Osinstead of-O2— drops loop-alignment NOPs.- Result: 116 bytes (29 insts). 88% size match. See
candidate.c.
Remaining gap to vendor (12 bytes = 3 instructions)
- GCC turns
(x & 0xF0000000) == 0intocmp w, w_loaded_const; b.lsinstead of vendor'stst w, #imm; b.eq. Costs 4 bytes per loop, twice = 8 bytes. - GCC's
[base+0x184]accesses inside the handshake loop areadd x1, x0, #0x200; ldur x2, [x1, #-124]— likely a ldp/ldur pair GCC's scheduler thinks is faster on Cortex-A76. Costs ~4 bytes.
Next iteration ideas
- Inline-asm for the mask-tests to force TST encoding directly. Cheap win, gets us to ~108 bytes.
- Clang (different scheduler, sometimes nicer with TST-style
comparisons). Try
clang -Oz -ffreestanding -target aarch64-none-elf. - ARMCC — the most likely vendor compiler. Sourcing armclang for AArch64 requires an Arm Developer account; backlog item.
- objdiff — once installed, automate the byte-diff scoring instead of eyeballing.
Workflow validation
- ✓ Function extracted from blob as standalone .bin slice.
- ✓ Three decompiler views captured (Ghidra, retdec, hand-written reference).
- ✓ Candidate compiles + runs (matches reference semantics).
- ✓ Single-pass byte-comparison done by hand; got 88% on iteration 2.
- ✗ objdiff not installed — would automate the scoring.
- ✗ decomp.me self-host not yet running on pve4 — would crowdsource the grind via the standard interface.
- ✗ ARMCC not installed — perfect-match unattainable without it.
The pipeline works. Each future poll-site function follows the same 4-step recipe: extract → Ghidra-clean → write candidate → iterate until ≥90 % match. Estimated ~2-3 h per function for the small ones.
How this connects to the v3fb work
This function contains 4 of the 16 poll sites. Once we have a byte-matching (or functionally-equivalent) C version, we can:
- Add bounded-retry counters in the C source — much cleaner than the asm trampoline patcher.
- Compile + link as a freestanding
.oat the original blob offset. - Splice into the blob, replacing
FUN_0000d328entirely.
That's the path to a maintainable replacement for the trampoline-based v3fb approach, for at least these 4 sites. The other 12 sites live in different functions and would each need their own lift.