9c20eb0135
Changed u64v handshake reads to u32v with an inline zero-extending upcast. Clang -Oz now emits 104 bytes, exactly matching vendor's 104 bytes, with 26 instructions on both sides. Three semantic-equivalent byte differences remain (register allocation, tst-form, test width) that aren't closable from C alone — need armclang or inline asm. Matching-decomp verdict for this function: semantic equivalence + size identity + instruction-count identity = the practical ceiling. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
148 lines
7.2 KiB
Markdown
148 lines
7.2 KiB
Markdown
# GRIND_LOG — first real-blob C-lift
|
|
|
|
Function: **FUN_0000d328** @ blob offset 0xd328 (104 bytes / 26 insts).
|
|
Contains 4 of our 16 timeout-less polls (sites 12, 13, 14, 15).
|
|
Semantics: **PHY block training step** — poke CTL, wait for two STAT
|
|
bits, apply two CFG values with HANDSHAKE acks, ack via CTL.
|
|
|
|
## Tools tried (single-pass, no iteration yet)
|
|
|
|
| tool | output file | grade |
|
|
|---|---|---|
|
|
| Ghidra 11.3 (auto-decompile) | `ghidra.c` | **A.** All 4 polls correctly modeled as `do {} while`. Collapsed the `(base + 0x8000) + offset` arithmetic into a single offset (`lVar1 + 0x8110` etc.) — actually MORE useful than a hand-written reference because it surfaces the absolute register addresses. Type cleanup needed (`undefined4`/`uint`/`long`). |
|
|
| retdec v5.0 (zero-touch raw mode) | `retdec.c` | **C.** Recognised the function and the polls but: misread bitmask tests as comparisons (`*v6 % 4 == 0` for `& 3`, `< 0x10000000` for `& 0xF0000000`). Fabricated a return value for a void function. Loop bodies marked as `continue ->` comments. Usable as a sanity-check second opinion, not as a basis for rewriting. |
|
|
| ground truth (hand-written) | `reference.c` | n/a — this is the canonical interpretation we judge against. |
|
|
|
|
## Matching-decomp candidate iterations (the actual grind)
|
|
|
|
Goal: a `.c` file that compiles to bytes close to the original 104-byte
|
|
slice. Score = `min(candidate_size, vendor_size) / max(candidate_size, vendor_size)`
|
|
after instruction-by-instruction diff (manual until objdiff is installed).
|
|
|
|
### Iteration 1: cast-on-each-access, `-O2`
|
|
- Pattern: `*(volatile u32 *)(base + offset)` per access.
|
|
- GCC behavior: materialised each `0x8XXX` offset into its own register
|
|
(`mov x2, #0x8120; add x2, x3, x2; ldr w0, [x2]`), exploding code size.
|
|
- Result: ~160 bytes. **53% size match. Bad.**
|
|
|
|
### Iteration 2 (current best): pre-adjust base outside volatile chain, `-Os`
|
|
- Pattern: `unsigned char *phy = base + 0x8000` once, then `*(u32v *)(phy + small)`.
|
|
- `-Os` instead of `-O2` — drops loop-alignment NOPs.
|
|
- Result: **116 bytes (29 insts)**. **88% size match.** See `candidate.c`.
|
|
|
|
### Remaining gap to vendor (12 bytes = 3 instructions)
|
|
|
|
1. GCC turns `(x & 0xF0000000) == 0` into `cmp w, w_loaded_const; b.ls`
|
|
instead of vendor's `tst w, #imm; b.eq`. Costs 4 bytes per loop, twice
|
|
= 8 bytes.
|
|
2. GCC's `[base+0x184]` accesses inside the handshake loop are
|
|
`add x1, x0, #0x200; ldur x2, [x1, #-124]` — likely a ldp/ldur pair
|
|
GCC's scheduler thinks is faster on Cortex-A76. Costs ~4 bytes.
|
|
|
|
### Next iteration ideas
|
|
|
|
- **Inline-asm** for the mask-tests to force TST encoding directly. Cheap
|
|
win, gets us to ~108 bytes.
|
|
- **Clang** (different scheduler, sometimes nicer with TST-style
|
|
comparisons). Try `clang -Oz -ffreestanding -target aarch64-none-elf`.
|
|
- **ARMCC** — the most likely vendor compiler. Sourcing armclang for
|
|
AArch64 requires an Arm Developer account; backlog item.
|
|
- **objdiff** — once installed, automate the byte-diff scoring instead
|
|
of eyeballing.
|
|
|
|
## Workflow validation
|
|
|
|
- ✓ Function extracted from blob as standalone .bin slice.
|
|
- ✓ Three decompiler views captured (Ghidra, retdec, hand-written reference).
|
|
- ✓ Candidate compiles + runs (matches reference semantics).
|
|
- ✓ Single-pass byte-comparison done by hand; got 88% on iteration 2.
|
|
- ✗ objdiff not installed — would automate the scoring.
|
|
- ✗ decomp.me self-host not yet running on pve4 — would crowdsource the
|
|
grind via the standard interface.
|
|
- ✗ ARMCC not installed — perfect-match unattainable without it.
|
|
|
|
**The pipeline works.** Each future poll-site function follows the
|
|
same 4-step recipe: extract → Ghidra-clean → write candidate → iterate
|
|
until ≥90 % match. Estimated ~2-3 h per function for the small ones.
|
|
|
|
## How this connects to the v3fb work
|
|
|
|
This function contains 4 of the 16 poll sites. Once we have a
|
|
byte-matching (or functionally-equivalent) C version, we can:
|
|
|
|
1. Add bounded-retry counters in the C source — much cleaner than the
|
|
asm trampoline patcher.
|
|
2. Compile + link as a freestanding `.o` at the original blob offset.
|
|
3. Splice into the blob, replacing `FUN_0000d328` entirely.
|
|
|
|
That's the path to a maintainable replacement for the trampoline-based
|
|
v3fb approach, **for at least these 4 sites**. The other 12 sites live
|
|
in different functions and would each need their own lift.
|
|
|
|
## Compiler matrix 2026-04-15 late evening
|
|
|
|
Tested the same `candidate.c` across GCC and clang:
|
|
|
|
| compiler | best flag | size | diff vs vendor 104 |
|
|
|---|---|---|---|
|
|
| gcc 15 | -Os | 116 B | +12 |
|
|
| gcc 15 | -O1 | 120 B | +16 |
|
|
| gcc 15 | -O2/-O3 | 128 B | +24 |
|
|
| **clang 19** | **-O2 / -Os / -Oz** | **108 B** | **+4** |
|
|
| clang 19 | -O1 | 112 B | +8 |
|
|
| vendor | | 104 B | 0 |
|
|
|
|
**Clang at -Oz is 4 bytes off vendor.** 96% size match on our first
|
|
compile. GCC -Os tops out at 12 bytes off — 89.7%. The difference is
|
|
consistent with how each compiler encodes mask-tests and the addressing
|
|
it picks for short-imm offsets into a base+offset pointer — clang
|
|
prefers `TST Wx, #imm` (single instruction, native imm encoding), GCC
|
|
prefers `MOV Wy, #const; CMP Wx, Wy; B.cc` (three instructions, larger).
|
|
|
|
**Consequence:** default compiler for matching-decomp on this blob is
|
|
clang, not GCC. Move already committed in this GRIND_LOG; all future
|
|
poll-site lifts should compile-eval under clang first.
|
|
|
|
**Hypothesis resolved:** the vendor compiler is almost certainly
|
|
**armclang** (ARM's LLVM-based fork) or a similarly-aggressive LLVM
|
|
variant — NOT GCC, NOT a dumbed-down rushed compiler. Evidence: their
|
|
output is SMALLER than GCC -Os, which rules out "naive". The fact
|
|
that clang -Oz approaches byte-match ruling suggests LLVM family.
|
|
|
|
**To push past 96%:** armclang itself (needs Arm Developer account /
|
|
free Community Edition), or continue clang -Oz + hand-tweaked C + per
|
|
-site inline asm where the last instruction doesn't converge. A single
|
|
afternoon's iteration should push to ≥99%.
|
|
|
|
## Iteration 3: 32-bit load + clang -Oz = 100% size match
|
|
|
|
Changed the handshake-loop reads from `u64v` to `u32v` (32-bit volatile
|
|
loads), with a tiny inline `xld()` helper that zero-extends to u64 for
|
|
the test. This forced clang to use `ldr w, [x, #0x184]` inside the
|
|
loops (instead of hoisting `add x9, x8, #0x184` out), cutting the
|
|
4-byte setup overhead.
|
|
|
|
| compiler | flag | size | diff | score |
|
|
|---|---|---|---|---|
|
|
| clang 19 | -Oz | **104 B** | **0** | **100% (size-match)** |
|
|
| gcc 15 | -Os | see below | see below | see below |
|
|
|
|
### Byte-level comparison (clang vs vendor, both 104 B, both 26 insts)
|
|
|
|
Three semantic-equivalent differences remain — not closable from C alone:
|
|
|
|
1. **Reg choice**: vendor `x0/w1`, clang `x8/w9/w10`.
|
|
2. **Mask test form**: vendor `tst w1, #0xf0000000; b.eq`, clang
|
|
`lsr w9, #28; cbz w9, .loop`. Same size, same effect.
|
|
3. **Handshake test width**: vendor `tst x1, #0x3` (64-bit on
|
|
zero-extended w1), clang `tst w9, #0x3` (32-bit). Same size.
|
|
|
|
None of these affect semantics. To chase byte-level exactness you'd need:
|
|
- inline asm stubs forcing the specific mask-test form
|
|
- register-allocation hints that C doesn't really expose
|
|
- **or** the vendor's actual armclang binary
|
|
|
|
**Verdict: done.** Semantic equivalence + identical size + identical
|
|
instruction count is the realistic ceiling from C. Further chase is
|
|
purely cosmetic.
|