diff --git a/phase8_iteration18_close.md b/phase8_iteration18_close.md new file mode 100644 index 0000000..b68cb04 --- /dev/null +++ b/phase8_iteration18_close.md @@ -0,0 +1,69 @@ +# Iteration 18 — Phase 8 (close) + +Closes 2026-05-14. iter18 = test mechanisms 3 (stale pointer) and 5 (silent partial failure) for iter17's finding. PARTIAL close. Both mechanisms disproved. + +## Mechanism tests + +### α-21 (mechanism 3 — stale stack-local pointers) + +Made libva's HEVC control structs `static` (file-scope), persisting indefinitely. Suppressed `free(slice_params_array)` so the heap-allocated SLICE_PARAMS also persists past `MEDIA_REQUEST_IOC_QUEUE`. + +**Result**: Hash unchanged (`06b2c5a0…`). Kernel pr_info still shows `w=0 h=0` for libva. Mechanism 3 **DISPROVED** — kernel does copy at S_EXT_CTRLS time, not deferred. + +### α-22 (mechanism 5 — silent partial failure via error_idx) + +Added libva-side logging of `controls.error_idx` after each `S_EXT_CTRLS`. Output: + +``` +S_EXT_CTRLS rc=0 errno=0 count=2 error_idx=1 request_fd=0 which=0x0 # device-init: WORKS +S_EXT_CTRLS rc=0 errno=0 count=5 error_idx=4 request_fd=8 which=0xf010000 # per-frame: BROKEN +``` + +`error_idx = count - 1` in BOTH the working (device-init, sets HEVC_DECODE_MODE + HEVC_START_CODE correctly) and broken cases. This is **not a failure indicator** in this kernel version — it appears to just be "index of last control processed." + +α-22 follow-up test removed DECODE_PARAMS (count=4); error_idx still = count-1 (=3). Removing DECODE_PARAMS didn't unblock — same all-zero kernel state. + +Mechanism 5 **DISPROVED** — error_idx isn't reporting partial failure. + +## Both reverted, backend restored to iter15 state (`c1d4bb53…`) + +iter18 ships zero code changes. All tests proved their hypotheses negative. + +## Remaining mechanisms (post-iter18) + +| # | Mechanism | Status | +|---|---|---| +| 1 | request_fd mismatch | unlikely (strace shows consistent fd) | +| 2 | REINIT clears controls between S_EXT_CTRLS and QUEUE | **untested — leading hypothesis** | +| 3 | Stack-locals stale | ❌ DISPROVED by α-21 | +| 4 | ctrl_hdl mismatch (libva submits to one, rkvdec reads from another) | **untested — possible** | +| 5 | Silent partial failure via error_idx | ❌ DISPROVED by α-22 | + +## iter19 candidate (α-23) + +Mechanism 2 test: temporarily disable `media_request_reinit()` in libva's `RequestSyncSurface` for HEVC. If the controls SURVIVE without REINIT-clearing them, mechanism 2 is confirmed. Then the fix is to reorder: REINIT must run **before** the next S_EXT_CTRLS, NOT after the previous decode (which is libva's current iter6 model). + +Or alternatively (mechanism 4): add deeper kernel printk that dumps `ctx->ctrl_hdl` pointer + the per-request `req->req` (V4L2 request handler) pointer, comparing libva-trigger vs kdirect-trigger. If they're different handlers, libva is staging to wrong one. + +The kernel-side approach (deeper printk) is more invasive but more definitive. Alternative: rebuild rkvdec_hevc_run_preamble to dump `&ctx->ctrl_hdl` AND first 16 bytes of `*run->sps`. If pointer is the same as a previous frame's, suggests no per-request update. + +## Substrate state at iter18 close + +- Fork tip `fc78ed4` on noether + fresnel + gitea (clean iter15 state). +- Backend SHA `c1d4bb53…` on fresnel (iter15 stable). +- Kernel `linux-fresnel-fourier 7.0-3` (with diagnostic printk; want to keep for iter19). +- 5-codec anchors: byte-identical to iter15 anchors. Zero regression. + +## iter17 finding stands + +Despite iter18's negative results on mechanisms 3 and 5, the iter17 empirical finding remains the campaign's strongest narrowing: + +**The kernel sees zero control values for libva HEVC (`run.sps->{w,h,reorder,chroma}=0,0,0,0`) but correct values for kdirect (`w=1280, h=720, reorder=2, chroma=1`).** + +The mechanism is still unknown but localized. The remaining productive direction is targeted kernel investigation of where libva's S_EXT_CTRLS payload lands vs where rkvdec reads from. + +## Lessons + +1. **error_idx semantics differ between kernel versions / paths.** Don't rely on it for partial-failure detection without first verifying the success-case value. +2. **Stack-local control pointers are SAFE for V4L2 compound controls** — the kernel copies immediately at S_EXT_CTRLS time. +3. The S_EXT_CTRLS → request → ctrl_hdl chain has at least one bug specific to libva's invocation pattern on RK3399 rkvdec, despite identical wire bytes vs kdirect.