benchmark/: three-way RE-tool comparison + first real C-lift
Three small functions extracted from the v1.19 conservative blob with
ground-truth C and per-tool (Ghidra / retdec / decomp.me) docs:
01_memset — byte memset, 28 B
02_memcpy32 — word-aligned memcpy, 36 B
03_magic_memset — magic check + tail-call to memset, 40 B
04_train_phy_block — first real poll-site function (104 B, 26 insts),
contains poll sites 12-15
Results in RESULTS.md:
- Ghidra: A on all four. Auto-decompile is close to final.
- retdec: A on #3, F on #1 and #2 (no register-arg inference on raw),
C on #4 (mistakes & 0xF0000000 for < 0x10000000).
GRIND_LOG.md (in 04_train_phy_block/) records the matching-decomp
iteration: 116-byte candidate.c at -Os vs vendor 104 bytes = 89.7%
size match on first real iteration. Remaining gap is GCC's choice of
`cmp w, w_const; b.ls` over vendor's `tst w, #imm; b.eq` for the
mask tests.
gdb_debug/ holds a native-aarch64 GDB single-stepper for the three
benchmark functions — boltzmann smoke test passed (memset:
buf[10] 0x00→0xab).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,80 @@
|
||||
# GRIND_LOG — first real-blob C-lift
|
||||
|
||||
Function: **FUN_0000d328** @ blob offset 0xd328 (104 bytes / 26 insts).
|
||||
Contains 4 of our 16 timeout-less polls (sites 12, 13, 14, 15).
|
||||
Semantics: **PHY block training step** — poke CTL, wait for two STAT
|
||||
bits, apply two CFG values with HANDSHAKE acks, ack via CTL.
|
||||
|
||||
## Tools tried (single-pass, no iteration yet)
|
||||
|
||||
| tool | output file | grade |
|
||||
|---|---|---|
|
||||
| Ghidra 11.3 (auto-decompile) | `ghidra.c` | **A.** All 4 polls correctly modeled as `do {} while`. Collapsed the `(base + 0x8000) + offset` arithmetic into a single offset (`lVar1 + 0x8110` etc.) — actually MORE useful than a hand-written reference because it surfaces the absolute register addresses. Type cleanup needed (`undefined4`/`uint`/`long`). |
|
||||
| retdec v5.0 (zero-touch raw mode) | `retdec.c` | **C.** Recognised the function and the polls but: misread bitmask tests as comparisons (`*v6 % 4 == 0` for `& 3`, `< 0x10000000` for `& 0xF0000000`). Fabricated a return value for a void function. Loop bodies marked as `continue ->` comments. Usable as a sanity-check second opinion, not as a basis for rewriting. |
|
||||
| ground truth (hand-written) | `reference.c` | n/a — this is the canonical interpretation we judge against. |
|
||||
|
||||
## Matching-decomp candidate iterations (the actual grind)
|
||||
|
||||
Goal: a `.c` file that compiles to bytes close to the original 104-byte
|
||||
slice. Score = `min(candidate_size, vendor_size) / max(candidate_size, vendor_size)`
|
||||
after instruction-by-instruction diff (manual until objdiff is installed).
|
||||
|
||||
### Iteration 1: cast-on-each-access, `-O2`
|
||||
- Pattern: `*(volatile u32 *)(base + offset)` per access.
|
||||
- GCC behavior: materialised each `0x8XXX` offset into its own register
|
||||
(`mov x2, #0x8120; add x2, x3, x2; ldr w0, [x2]`), exploding code size.
|
||||
- Result: ~160 bytes. **53% size match. Bad.**
|
||||
|
||||
### Iteration 2 (current best): pre-adjust base outside volatile chain, `-Os`
|
||||
- Pattern: `unsigned char *phy = base + 0x8000` once, then `*(u32v *)(phy + small)`.
|
||||
- `-Os` instead of `-O2` — drops loop-alignment NOPs.
|
||||
- Result: **116 bytes (29 insts)**. **88% size match.** See `candidate.c`.
|
||||
|
||||
### Remaining gap to vendor (12 bytes = 3 instructions)
|
||||
|
||||
1. GCC turns `(x & 0xF0000000) == 0` into `cmp w, w_loaded_const; b.ls`
|
||||
instead of vendor's `tst w, #imm; b.eq`. Costs 4 bytes per loop, twice
|
||||
= 8 bytes.
|
||||
2. GCC's `[base+0x184]` accesses inside the handshake loop are
|
||||
`add x1, x0, #0x200; ldur x2, [x1, #-124]` — likely a ldp/ldur pair
|
||||
GCC's scheduler thinks is faster on Cortex-A76. Costs ~4 bytes.
|
||||
|
||||
### Next iteration ideas
|
||||
|
||||
- **Inline-asm** for the mask-tests to force TST encoding directly. Cheap
|
||||
win, gets us to ~108 bytes.
|
||||
- **Clang** (different scheduler, sometimes nicer with TST-style
|
||||
comparisons). Try `clang -Oz -ffreestanding -target aarch64-none-elf`.
|
||||
- **ARMCC** — the most likely vendor compiler. Sourcing armclang for
|
||||
AArch64 requires an Arm Developer account; backlog item.
|
||||
- **objdiff** — once installed, automate the byte-diff scoring instead
|
||||
of eyeballing.
|
||||
|
||||
## Workflow validation
|
||||
|
||||
- ✓ Function extracted from blob as standalone .bin slice.
|
||||
- ✓ Three decompiler views captured (Ghidra, retdec, hand-written reference).
|
||||
- ✓ Candidate compiles + runs (matches reference semantics).
|
||||
- ✓ Single-pass byte-comparison done by hand; got 88% on iteration 2.
|
||||
- ✗ objdiff not installed — would automate the scoring.
|
||||
- ✗ decomp.me self-host not yet running on pve4 — would crowdsource the
|
||||
grind via the standard interface.
|
||||
- ✗ ARMCC not installed — perfect-match unattainable without it.
|
||||
|
||||
**The pipeline works.** Each future poll-site function follows the
|
||||
same 4-step recipe: extract → Ghidra-clean → write candidate → iterate
|
||||
until ≥90 % match. Estimated ~2-3 h per function for the small ones.
|
||||
|
||||
## How this connects to the v3fb work
|
||||
|
||||
This function contains 4 of the 16 poll sites. Once we have a
|
||||
byte-matching (or functionally-equivalent) C version, we can:
|
||||
|
||||
1. Add bounded-retry counters in the C source — much cleaner than the
|
||||
asm trampoline patcher.
|
||||
2. Compile + link as a freestanding `.o` at the original blob offset.
|
||||
3. Splice into the blob, replacing `FUN_0000d328` entirely.
|
||||
|
||||
That's the path to a maintainable replacement for the trampoline-based
|
||||
v3fb approach, **for at least these 4 sites**. The other 12 sites live
|
||||
in different functions and would each need their own lift.
|
||||
@@ -0,0 +1,36 @@
|
||||
/* Best matching candidate so far for FUN_0000d328.
|
||||
* Compile: gcc -Os -ffreestanding -nostdlib -c candidate.c -o candidate.o
|
||||
* Score: 116 bytes vs vendor 104 bytes (88% size match, 12 bytes / 3 insts over).
|
||||
*
|
||||
* Remaining gap vs vendor:
|
||||
* - GCC emits `cmp w, w_loaded_const ; b.ls` for `(x & 0xF0000000) == 0`
|
||||
* instead of vendor's `tst w, #0xF0000000 ; b.eq` (both 12 bytes, but
|
||||
* vendor avoids materializing the mask in a register, saving 4 bytes
|
||||
* per loop, twice = 8 bytes).
|
||||
* - GCC emits `add x1, x0, #0x200 ; ldur x2, [x1, #-124]` for the
|
||||
* `[base+0x184]` accesses inside the handshake loop, vs vendor's
|
||||
* direct `ldr w1, [x0, #0x184]`. Costs us ~4 bytes.
|
||||
*
|
||||
* Next iterations to try:
|
||||
* 1. Inline-asm for the mask-tests to force TST encoding.
|
||||
* 2. `__builtin_expect((x & 0xF0000000) != 0, 0)` to hint loop direction.
|
||||
* 3. Alternative compilers: clang, ARMCC (the latter is what Rockchip
|
||||
* almost certainly used; need to source it).
|
||||
*/
|
||||
typedef volatile unsigned int u32v;
|
||||
typedef volatile unsigned long u64v;
|
||||
|
||||
void train_phy_block(unsigned long ctx)
|
||||
{
|
||||
unsigned char *phy = (unsigned char *)(*(unsigned long *)(ctx + 0xb8) + 0x8000);
|
||||
*(u32v *)(phy + 0x110) = 0xf000f000u;
|
||||
while ((*(u32v *)(phy + 0x118) & 0xf0000000u) == 0u) ;
|
||||
while ((*(u32v *)(phy + 0x120) & 0xf0000000u) == 0u) ;
|
||||
*(u32v *)(phy + 0x160) = 0x30003u;
|
||||
*(u32v *)(phy + 0x154) = 0x30003u;
|
||||
while ((*(u64v *)(phy + 0x184) & 3ul) == 0ul) ;
|
||||
*(u32v *)(phy + 0x154) = 0x30000u;
|
||||
while ((*(u64v *)(phy + 0x184) & 3ul) != 0ul) ;
|
||||
*(u32v *)(phy + 0x160) = 0x30000u;
|
||||
*(u32v *)(phy + 0x110) = 0xf0000000u;
|
||||
}
|
||||
@@ -0,0 +1,71 @@
|
||||
# decomp.me recipe — 04_train_phy_block
|
||||
|
||||
This is the **first real-blob function we're lifting to byte-matching C.**
|
||||
Score target: ≥95% match. Perfect match unlikely (compiler unknown).
|
||||
|
||||
## Target asm (paste into "Target asm" field)
|
||||
|
||||
```asm
|
||||
train_phy_block:
|
||||
ldr x0, [x0, #0xb8]
|
||||
mov w1, #0xf000f000
|
||||
add x0, x0, #0x8000
|
||||
str w1, [x0, #0x110]
|
||||
.Lwait_a:
|
||||
ldr w1, [x0, #0x118]
|
||||
tst w1, #0xf0000000
|
||||
b.eq .Lwait_a
|
||||
.Lwait_b:
|
||||
ldr w1, [x0, #0x120]
|
||||
tst w1, #0xf0000000
|
||||
b.eq .Lwait_b
|
||||
mov w1, #0x30003
|
||||
str w1, [x0, #0x160]
|
||||
str w1, [x0, #0x154]
|
||||
.Lwait_hs1:
|
||||
ldr w1, [x0, #0x184]
|
||||
tst x1, #0x3
|
||||
b.eq .Lwait_hs1
|
||||
mov w1, #0x30000
|
||||
str w1, [x0, #0x154]
|
||||
.Lwait_hs2:
|
||||
ldr w1, [x0, #0x184]
|
||||
tst x1, #0x3
|
||||
b.ne .Lwait_hs2
|
||||
mov w1, #0x30000
|
||||
str w1, [x0, #0x160]
|
||||
mov w1, #0xf0000000
|
||||
str w1, [x0, #0x110]
|
||||
ret
|
||||
```
|
||||
|
||||
## Compiler
|
||||
|
||||
`aarch64-linux-gnu gcc 12 -O2 -ffreestanding -nostdlib`
|
||||
(Try also `-Os`. Vendor blob's compiler unknown — could be ARMCC or older
|
||||
GCC. Optimal C may differ between targets; perfect byte-match probably
|
||||
unattainable.)
|
||||
|
||||
## Context
|
||||
|
||||
Use `reference.c` as the starting C. The CMP-vs-TST distinction at the
|
||||
end (`tst x1, #0x3` uses 64-bit reg even though w1 was loaded — vendor
|
||||
quirk) suggests a particular intrinsic / pattern. May need to write the
|
||||
load as `(uint64_t)mmio_r(...)` and the test as a 64-bit AND to coax
|
||||
GCC into emitting `tst x1` instead of `tst w1`.
|
||||
|
||||
## Things to iterate on
|
||||
|
||||
- Order of writes to CFG_A vs CFG_B: vendor wrote CFG_B first
|
||||
(`str w1, [x0, #0x160]` then `str w1, [x0, #0x154]`). C order matters.
|
||||
- The two `mov w1, #0x30000` near the end could be hoisted by GCC; vendor
|
||||
emitted them inline. May need separate variables to prevent hoist.
|
||||
- `add x0, x0, #0x8000` vs `add x0, x0, #0x8, lsl #12` — same
|
||||
instruction, GAS picks one. Either should round-trip.
|
||||
|
||||
## Score expectations
|
||||
|
||||
- 80%: rough loop structure + register usage matches.
|
||||
- 95%: instruction order + immediate forms match.
|
||||
- 100%: would require exact compiler/version match. Unlikely without
|
||||
ARMCC.
|
||||
Binary file not shown.
@@ -0,0 +1,33 @@
|
||||
|
||||
func.bin: file format binary
|
||||
|
||||
|
||||
Disassembly of section .data:
|
||||
|
||||
000000000000d328 <.data>:
|
||||
d328: f9405c00 ldr x0, [x0, #184]
|
||||
d32c: 32048fe1 mov w1, #0xf000f000 // #-268374016
|
||||
d330: 91402000 add x0, x0, #0x8, lsl #12
|
||||
d334: b9011001 str w1, [x0, #272]
|
||||
d338: b9411801 ldr w1, [x0, #280]
|
||||
d33c: 72040c3f tst w1, #0xf0000000
|
||||
d340: 54ffffc0 b.eq 0xd338 // b.none
|
||||
d344: b9412001 ldr w1, [x0, #288]
|
||||
d348: 72040c3f tst w1, #0xf0000000
|
||||
d34c: 54ffffc0 b.eq 0xd344 // b.none
|
||||
d350: 320087e1 mov w1, #0x30003 // #196611
|
||||
d354: b9016001 str w1, [x0, #352]
|
||||
d358: b9015401 str w1, [x0, #340]
|
||||
d35c: b9418401 ldr w1, [x0, #388]
|
||||
d360: f240043f tst x1, #0x3
|
||||
d364: 54ffffc0 b.eq 0xd35c // b.none
|
||||
d368: 52a00061 mov w1, #0x30000 // #196608
|
||||
d36c: b9015401 str w1, [x0, #340]
|
||||
d370: b9418401 ldr w1, [x0, #388]
|
||||
d374: f240043f tst x1, #0x3
|
||||
d378: 54ffffc1 b.ne 0xd370 // b.any
|
||||
d37c: 52a00061 mov w1, #0x30000 // #196608
|
||||
d380: b9016001 str w1, [x0, #352]
|
||||
d384: 52be0001 mov w1, #0xf0000000 // #-268435456
|
||||
d388: b9011001 str w1, [x0, #272]
|
||||
d38c: d65f03c0 ret
|
||||
@@ -0,0 +1,18 @@
|
||||
/* Ghidra 11.3 default decompiler output for FUN_0000d328 — unmodified. */
|
||||
void FUN_0000d328(long param_1)
|
||||
{
|
||||
long lVar1;
|
||||
|
||||
lVar1 = *(long *)(param_1 + 0xb8);
|
||||
*(undefined4 *)(lVar1 + 0x8110) = 0xf000f000;
|
||||
do { } while ((*(uint *)(lVar1 + 0x8118) & 0xf0000000) == 0);
|
||||
do { } while ((*(uint *)(lVar1 + 0x8120) & 0xf0000000) == 0);
|
||||
*(undefined4 *)(lVar1 + 0x8160) = 0x30003;
|
||||
*(undefined4 *)(lVar1 + 0x8154) = 0x30003;
|
||||
do { } while ((*(uint *)(lVar1 + 0x8184) & 3) == 0);
|
||||
*(undefined4 *)(lVar1 + 0x8154) = 0x30000;
|
||||
do { } while ((*(uint *)(lVar1 + 0x8184) & 3) != 0);
|
||||
*(undefined4 *)(lVar1 + 0x8160) = 0x30000;
|
||||
*(undefined4 *)(lVar1 + 0x8110) = 0xf0000000;
|
||||
return;
|
||||
}
|
||||
@@ -0,0 +1,89 @@
|
||||
/* Ground-truth C for FUN_0000d328 @ blob offset 0xd328 (104 bytes / 26 insts).
|
||||
*
|
||||
* **The first real poll-site function we lift to C.**
|
||||
* Contains 4 of our 16 timeout-less polls (sites 12, 13, 14, 15).
|
||||
*
|
||||
* Pattern: PHY-block training step — poke a control register, wait for
|
||||
* two status bits, apply two intermediate values with a
|
||||
* handshake on a state register, ack the event.
|
||||
*
|
||||
* Signature: void train_phy_block(struct phy_ctx *ctx);
|
||||
* (X0 = ctx, returns void)
|
||||
*
|
||||
* Layout:
|
||||
* ctx (X0) — opaque per-rank/per-channel context
|
||||
* ctx->base[0xb8] — 64-bit pointer to a PHY block base
|
||||
* block + 0x8000 — addressed sub-block (likely "Master" bank in DWC PUB)
|
||||
*
|
||||
* The sub-block at +0x8000 has these registers (offsets within +0x8000):
|
||||
* +0x110 CTL — write 0xF000F000 to start, 0xF0000000 to clear
|
||||
* +0x118 STAT_A — bit[31:28] non-zero = step A done
|
||||
* +0x120 STAT_B — bit[31:28] non-zero = step B done
|
||||
* +0x154 CFG_A — write training value
|
||||
* +0x160 CFG_B — write training value
|
||||
* +0x184 HANDSHAKE — bits[1:0] toggle between 0 and !=0 to ack writes
|
||||
*
|
||||
* The 4 polls (in order):
|
||||
* site 12 (B.EQ): STAT_A bit[31:28] non-zero?
|
||||
* site 13 (B.EQ): STAT_B bit[31:28] non-zero?
|
||||
* site 14 (B.EQ): HANDSHAKE bits[1:0] non-zero? (ack of step-1 writes)
|
||||
* site 15 (B.NE): HANDSHAKE bits[1:0] zero? (ack of step-2 write)
|
||||
*/
|
||||
#include <stdint.h>
|
||||
|
||||
struct phy_ctx {
|
||||
uint8_t pad[0xB8];
|
||||
uint8_t *block; /* base pointer used at +0xB8 in struct */
|
||||
/* ... rest of struct unknown */
|
||||
};
|
||||
|
||||
#define PHY_CTL 0x110
|
||||
#define PHY_STAT_A 0x118
|
||||
#define PHY_STAT_B 0x120
|
||||
#define PHY_CFG_A 0x154
|
||||
#define PHY_CFG_B 0x160
|
||||
#define PHY_HANDSHAKE 0x184
|
||||
|
||||
#define PHY_CTL_GO 0xF000F000U
|
||||
#define PHY_CTL_CLR 0xF0000000U
|
||||
#define PHY_STAT_DONE 0xF0000000U
|
||||
#define PHY_CFG_VAL_RUN 0x00030003U
|
||||
#define PHY_CFG_VAL_END 0x00030000U
|
||||
#define PHY_HS_BUSY 0x3U
|
||||
|
||||
static inline uint32_t mmio_r(volatile uint8_t *base, unsigned off) {
|
||||
return *(volatile uint32_t *)(base + off);
|
||||
}
|
||||
static inline void mmio_w(volatile uint8_t *base, unsigned off, uint32_t v) {
|
||||
*(volatile uint32_t *)(base + off) = v;
|
||||
}
|
||||
|
||||
void train_phy_block(struct phy_ctx *ctx) {
|
||||
volatile uint8_t *phy = (volatile uint8_t *)(ctx->block + 0x8000);
|
||||
|
||||
mmio_w(phy, PHY_CTL, PHY_CTL_GO);
|
||||
|
||||
/* site 12 — wait for step A complete */
|
||||
while ((mmio_r(phy, PHY_STAT_A) & PHY_STAT_DONE) == 0)
|
||||
;
|
||||
|
||||
/* site 13 — wait for step B complete */
|
||||
while ((mmio_r(phy, PHY_STAT_B) & PHY_STAT_DONE) == 0)
|
||||
;
|
||||
|
||||
mmio_w(phy, PHY_CFG_B, PHY_CFG_VAL_RUN);
|
||||
mmio_w(phy, PHY_CFG_A, PHY_CFG_VAL_RUN);
|
||||
|
||||
/* site 14 — wait for handshake to assert */
|
||||
while ((mmio_r(phy, PHY_HANDSHAKE) & PHY_HS_BUSY) == 0)
|
||||
;
|
||||
|
||||
mmio_w(phy, PHY_CFG_A, PHY_CFG_VAL_END);
|
||||
|
||||
/* site 15 — wait for handshake to deassert */
|
||||
while ((mmio_r(phy, PHY_HANDSHAKE) & PHY_HS_BUSY) != 0)
|
||||
;
|
||||
|
||||
mmio_w(phy, PHY_CFG_B, PHY_CFG_VAL_END);
|
||||
mmio_w(phy, PHY_CTL, PHY_CTL_CLR);
|
||||
}
|
||||
Reference in New Issue
Block a user