Files

T

marfrit 9c20eb0135 04_train_phy_block: clang -Oz + 32-bit-load pattern = 100% size match

Changed u64v handshake reads to u32v with an inline zero-extending
upcast. Clang -Oz now emits 104 bytes, exactly matching vendor's 104
bytes, with 26 instructions on both sides. Three semantic-equivalent
byte differences remain (register allocation, tst-form, test width)
that aren't closable from C alone — need armclang or inline asm.

Matching-decomp verdict for this function: semantic equivalence +
size identity + instruction-count identity = the practical ceiling.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-15 09:16:00 +02:00

7.2 KiB

Raw Permalink Blame History

GRIND_LOG — first real-blob C-lift

Function: FUN_0000d328 @ blob offset 0xd328 (104 bytes / 26 insts). Contains 4 of our 16 timeout-less polls (sites 12, 13, 14, 15). Semantics: PHY block training step — poke CTL, wait for two STAT bits, apply two CFG values with HANDSHAKE acks, ack via CTL.

Tools tried (single-pass, no iteration yet)

tool	output file	grade
Ghidra 11.3 (auto-decompile)	`ghidra.c`	A. All 4 polls correctly modeled as `do {} while`. Collapsed the `(base + 0x8000) + offset` arithmetic into a single offset (`lVar1 + 0x8110` etc.) — actually MORE useful than a hand-written reference because it surfaces the absolute register addresses. Type cleanup needed (`undefined4`/`uint`/`long`).
retdec v5.0 (zero-touch raw mode)	`retdec.c`	C. Recognised the function and the polls but: misread bitmask tests as comparisons (`*v6 % 4 == 0` for `& 3`, `< 0x10000000` for `& 0xF0000000`). Fabricated a return value for a void function. Loop bodies marked as `continue ->` comments. Usable as a sanity-check second opinion, not as a basis for rewriting.
ground truth (hand-written)	`reference.c`	n/a — this is the canonical interpretation we judge against.

Matching-decomp candidate iterations (the actual grind)

Goal: a .c file that compiles to bytes close to the original 104-byte slice. Score = min(candidate_size, vendor_size) / max(candidate_size, vendor_size) after instruction-by-instruction diff (manual until objdiff is installed).

Iteration 1: cast-on-each-access, `-O2`

Pattern: *(volatile u32 *)(base + offset) per access.
GCC behavior: materialised each 0x8XXX offset into its own register (mov x2, #0x8120; add x2, x3, x2; ldr w0, [x2]), exploding code size.
Result: ~160 bytes. 53% size match. Bad.

Iteration 2 (current best): pre-adjust base outside volatile chain, `-Os`

Pattern: unsigned char *phy = base + 0x8000 once, then *(u32v *)(phy + small).
-Os instead of -O2 — drops loop-alignment NOPs.
Result: 116 bytes (29 insts). 88% size match. See candidate.c.

Remaining gap to vendor (12 bytes = 3 instructions)

GCC turns (x & 0xF0000000) == 0 into cmp w, w_loaded_const; b.ls instead of vendor's tst w, #imm; b.eq. Costs 4 bytes per loop, twice = 8 bytes.
GCC's [base+0x184] accesses inside the handshake loop are add x1, x0, #0x200; ldur x2, [x1, #-124] — likely a ldp/ldur pair GCC's scheduler thinks is faster on Cortex-A76. Costs ~4 bytes.

Next iteration ideas

Inline-asm for the mask-tests to force TST encoding directly. Cheap win, gets us to ~108 bytes.
Clang (different scheduler, sometimes nicer with TST-style comparisons). Try clang -Oz -ffreestanding -target aarch64-none-elf.
ARMCC — the most likely vendor compiler. Sourcing armclang for AArch64 requires an Arm Developer account; backlog item.
objdiff — once installed, automate the byte-diff scoring instead of eyeballing.

Workflow validation

✓ Function extracted from blob as standalone .bin slice.
✓ Three decompiler views captured (Ghidra, retdec, hand-written reference).
✓ Candidate compiles + runs (matches reference semantics).
✓ Single-pass byte-comparison done by hand; got 88% on iteration 2.
✗ objdiff not installed — would automate the scoring.
✗ decomp.me self-host not yet running on pve4 — would crowdsource the grind via the standard interface.
✗ ARMCC not installed — perfect-match unattainable without it.

The pipeline works. Each future poll-site function follows the same 4-step recipe: extract → Ghidra-clean → write candidate → iterate until ≥90 % match. Estimated ~2-3 h per function for the small ones.

How this connects to the v3fb work

This function contains 4 of the 16 poll sites. Once we have a byte-matching (or functionally-equivalent) C version, we can:

Add bounded-retry counters in the C source — much cleaner than the asm trampoline patcher.
Compile + link as a freestanding .o at the original blob offset.
Splice into the blob, replacing FUN_0000d328 entirely.

That's the path to a maintainable replacement for the trampoline-based v3fb approach, for at least these 4 sites. The other 12 sites live in different functions and would each need their own lift.

Compiler matrix 2026-04-15 late evening

Tested the same candidate.c across GCC and clang:

compiler	best flag	size	diff vs vendor 104
gcc 15	-Os	116 B	+12
gcc 15	-O1	120 B	+16
gcc 15	-O2/-O3	128 B	+24
clang 19	-O2 / -Os / -Oz	108 B	+4
clang 19	-O1	112 B	+8
vendor		104 B	0

Clang at -Oz is 4 bytes off vendor. 96% size match on our first compile. GCC -Os tops out at 12 bytes off — 89.7%. The difference is consistent with how each compiler encodes mask-tests and the addressing it picks for short-imm offsets into a base+offset pointer — clang prefers TST Wx, #imm (single instruction, native imm encoding), GCC prefers MOV Wy, #const; CMP Wx, Wy; B.cc (three instructions, larger).

Consequence: default compiler for matching-decomp on this blob is clang, not GCC. Move already committed in this GRIND_LOG; all future poll-site lifts should compile-eval under clang first.

Hypothesis resolved: the vendor compiler is almost certainly armclang (ARM's LLVM-based fork) or a similarly-aggressive LLVM variant — NOT GCC, NOT a dumbed-down rushed compiler. Evidence: their output is SMALLER than GCC -Os, which rules out "naive". The fact that clang -Oz approaches byte-match ruling suggests LLVM family.

To push past 96%: armclang itself (needs Arm Developer account / free Community Edition), or continue clang -Oz + hand-tweaked C + per -site inline asm where the last instruction doesn't converge. A single afternoon's iteration should push to ≥99%.

Iteration 3: 32-bit load + clang -Oz = 100% size match

Changed the handshake-loop reads from u64v to u32v (32-bit volatile loads), with a tiny inline xld() helper that zero-extends to u64 for the test. This forced clang to use ldr w, [x, #0x184] inside the loops (instead of hoisting add x9, x8, #0x184 out), cutting the 4-byte setup overhead.

compiler	flag	size	diff	score
clang 19	-Oz	104 B	0	100% (size-match)
gcc 15	-Os	see below	see below	see below

Byte-level comparison (clang vs vendor, both 104 B, both 26 insts)

Three semantic-equivalent differences remain — not closable from C alone:

Reg choice: vendor x0/w1, clang x8/w9/w10.
Mask test form: vendor tst w1, #0xf0000000; b.eq, clang lsr w9, #28; cbz w9, .loop. Same size, same effect.
Handshake test width: vendor tst x1, #0x3 (64-bit on zero-extended w1), clang tst w9, #0x3 (32-bit). Same size.

None of these affect semantics. To chase byte-level exactness you'd need:

inline asm stubs forcing the specific mask-test form
register-allocation hints that C doesn't really expose
or the vendor's actual armclang binary

Verdict: done. Semantic equivalence + identical size + identical instruction count is the realistic ceiling from C. Further chase is purely cosmetic.

7.2 KiB Raw Permalink Blame History