Changed u64v handshake reads to u32v with an inline zero-extending upcast. Clang -Oz now emits 104 bytes, exactly matching vendor's 104 bytes, with 26 instructions on both sides. Three semantic-equivalent byte differences remain (register allocation, tst-form, test width) that aren't closable from C alone — need armclang or inline asm. Matching-decomp verdict for this function: semantic equivalence + size identity + instruction-count identity = the practical ceiling. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
7.2 KiB
GRIND_LOG — first real-blob C-lift
Function: FUN_0000d328 @ blob offset 0xd328 (104 bytes / 26 insts). Contains 4 of our 16 timeout-less polls (sites 12, 13, 14, 15). Semantics: PHY block training step — poke CTL, wait for two STAT bits, apply two CFG values with HANDSHAKE acks, ack via CTL.
Tools tried (single-pass, no iteration yet)
| tool | output file | grade |
|---|---|---|
| Ghidra 11.3 (auto-decompile) | ghidra.c |
A. All 4 polls correctly modeled as do {} while. Collapsed the (base + 0x8000) + offset arithmetic into a single offset (lVar1 + 0x8110 etc.) — actually MORE useful than a hand-written reference because it surfaces the absolute register addresses. Type cleanup needed (undefined4/uint/long). |
| retdec v5.0 (zero-touch raw mode) | retdec.c |
C. Recognised the function and the polls but: misread bitmask tests as comparisons (*v6 % 4 == 0 for & 3, < 0x10000000 for & 0xF0000000). Fabricated a return value for a void function. Loop bodies marked as continue -> comments. Usable as a sanity-check second opinion, not as a basis for rewriting. |
| ground truth (hand-written) | reference.c |
n/a — this is the canonical interpretation we judge against. |
Matching-decomp candidate iterations (the actual grind)
Goal: a .c file that compiles to bytes close to the original 104-byte
slice. Score = min(candidate_size, vendor_size) / max(candidate_size, vendor_size)
after instruction-by-instruction diff (manual until objdiff is installed).
Iteration 1: cast-on-each-access, -O2
- Pattern:
*(volatile u32 *)(base + offset)per access. - GCC behavior: materialised each
0x8XXXoffset into its own register (mov x2, #0x8120; add x2, x3, x2; ldr w0, [x2]), exploding code size. - Result: ~160 bytes. 53% size match. Bad.
Iteration 2 (current best): pre-adjust base outside volatile chain, -Os
- Pattern:
unsigned char *phy = base + 0x8000once, then*(u32v *)(phy + small). -Osinstead of-O2— drops loop-alignment NOPs.- Result: 116 bytes (29 insts). 88% size match. See
candidate.c.
Remaining gap to vendor (12 bytes = 3 instructions)
- GCC turns
(x & 0xF0000000) == 0intocmp w, w_loaded_const; b.lsinstead of vendor'stst w, #imm; b.eq. Costs 4 bytes per loop, twice = 8 bytes. - GCC's
[base+0x184]accesses inside the handshake loop areadd x1, x0, #0x200; ldur x2, [x1, #-124]— likely a ldp/ldur pair GCC's scheduler thinks is faster on Cortex-A76. Costs ~4 bytes.
Next iteration ideas
- Inline-asm for the mask-tests to force TST encoding directly. Cheap win, gets us to ~108 bytes.
- Clang (different scheduler, sometimes nicer with TST-style
comparisons). Try
clang -Oz -ffreestanding -target aarch64-none-elf. - ARMCC — the most likely vendor compiler. Sourcing armclang for AArch64 requires an Arm Developer account; backlog item.
- objdiff — once installed, automate the byte-diff scoring instead of eyeballing.
Workflow validation
- ✓ Function extracted from blob as standalone .bin slice.
- ✓ Three decompiler views captured (Ghidra, retdec, hand-written reference).
- ✓ Candidate compiles + runs (matches reference semantics).
- ✓ Single-pass byte-comparison done by hand; got 88% on iteration 2.
- ✗ objdiff not installed — would automate the scoring.
- ✗ decomp.me self-host not yet running on pve4 — would crowdsource the grind via the standard interface.
- ✗ ARMCC not installed — perfect-match unattainable without it.
The pipeline works. Each future poll-site function follows the same 4-step recipe: extract → Ghidra-clean → write candidate → iterate until ≥90 % match. Estimated ~2-3 h per function for the small ones.
How this connects to the v3fb work
This function contains 4 of the 16 poll sites. Once we have a byte-matching (or functionally-equivalent) C version, we can:
- Add bounded-retry counters in the C source — much cleaner than the asm trampoline patcher.
- Compile + link as a freestanding
.oat the original blob offset. - Splice into the blob, replacing
FUN_0000d328entirely.
That's the path to a maintainable replacement for the trampoline-based v3fb approach, for at least these 4 sites. The other 12 sites live in different functions and would each need their own lift.
Compiler matrix 2026-04-15 late evening
Tested the same candidate.c across GCC and clang:
| compiler | best flag | size | diff vs vendor 104 |
|---|---|---|---|
| gcc 15 | -Os | 116 B | +12 |
| gcc 15 | -O1 | 120 B | +16 |
| gcc 15 | -O2/-O3 | 128 B | +24 |
| clang 19 | -O2 / -Os / -Oz | 108 B | +4 |
| clang 19 | -O1 | 112 B | +8 |
| vendor | 104 B | 0 |
Clang at -Oz is 4 bytes off vendor. 96% size match on our first
compile. GCC -Os tops out at 12 bytes off — 89.7%. The difference is
consistent with how each compiler encodes mask-tests and the addressing
it picks for short-imm offsets into a base+offset pointer — clang
prefers TST Wx, #imm (single instruction, native imm encoding), GCC
prefers MOV Wy, #const; CMP Wx, Wy; B.cc (three instructions, larger).
Consequence: default compiler for matching-decomp on this blob is clang, not GCC. Move already committed in this GRIND_LOG; all future poll-site lifts should compile-eval under clang first.
Hypothesis resolved: the vendor compiler is almost certainly armclang (ARM's LLVM-based fork) or a similarly-aggressive LLVM variant — NOT GCC, NOT a dumbed-down rushed compiler. Evidence: their output is SMALLER than GCC -Os, which rules out "naive". The fact that clang -Oz approaches byte-match ruling suggests LLVM family.
To push past 96%: armclang itself (needs Arm Developer account / free Community Edition), or continue clang -Oz + hand-tweaked C + per -site inline asm where the last instruction doesn't converge. A single afternoon's iteration should push to ≥99%.
Iteration 3: 32-bit load + clang -Oz = 100% size match
Changed the handshake-loop reads from u64v to u32v (32-bit volatile
loads), with a tiny inline xld() helper that zero-extends to u64 for
the test. This forced clang to use ldr w, [x, #0x184] inside the
loops (instead of hoisting add x9, x8, #0x184 out), cutting the
4-byte setup overhead.
| compiler | flag | size | diff | score |
|---|---|---|---|---|
| clang 19 | -Oz | 104 B | 0 | 100% (size-match) |
| gcc 15 | -Os | see below | see below | see below |
Byte-level comparison (clang vs vendor, both 104 B, both 26 insts)
Three semantic-equivalent differences remain — not closable from C alone:
- Reg choice: vendor
x0/w1, clangx8/w9/w10. - Mask test form: vendor
tst w1, #0xf0000000; b.eq, clanglsr w9, #28; cbz w9, .loop. Same size, same effect. - Handshake test width: vendor
tst x1, #0x3(64-bit on zero-extended w1), clangtst w9, #0x3(32-bit). Same size.
None of these affect semantics. To chase byte-level exactness you'd need:
- inline asm stubs forcing the specific mask-test form
- register-allocation hints that C doesn't really expose
- or the vendor's actual armclang binary
Verdict: done. Semantic equivalence + identical size + identical instruction count is the realistic ceiling from C. Further chase is purely cosmetic.